# Using Parallelism with Machine Learning: The Housing Prices Competition 

## Description of the competition

- The Housing Prices Competition train_dataset consists of various features of residential homes in Ames, Iowa, including both quantitative and categorical variables like the size of the property, the number of rooms, year built, and neighborhood quality.
- It includes a set of 79 explanatory variables describing almost every aspect of the houses, allowing for in-depth analysis.
- *The primary goal* of the competition is to predict **the final price of each home**, in this lab we will use *RandomForests*.
- The models are evaluated on Root Mean Squared Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price, encouraging precise predictions over a range of housing prices.

### File descriptions
- *train.csv*: the training set used to train the model.
- *test.csv*: the test set used to compute the performance of the model.
- *train_data_description.txt*: full description of each column.
### Useful train_data fields

Here's a brief version of what you'll find in the train_data description file.

- *SalePrice*: the property's sale price in dollars. This is the target variable that you're trying to predict.
- *MSSubClass*: The building class
- *MSZoning*: The general zoning classification

Teh train_dataset is acessible here: https://www.kaggle.com/code/dansbecker/random-forests/tutorial

## Read and prepare the train_data
*If you're curious about this the professor can explain it for you*.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the train_dataset
file_path = '../data/train.csv'
train_data = pd.read_csv(file_path, index_col="Id")

# Columns to be deleted
columns_to_delete = ['MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

# Delete the specified columns
train_data_cleaned = train_data.drop(columns=columns_to_delete, axis=1)

# Define the input features (X) and the output (y)
X = train_data_cleaned.drop('SalePrice', axis=1)
y = train_data_cleaned['SalePrice']

# Identify the categorical columns in X
categorical_columns = X.select_dtypes(include=['object']).columns

# Initialize a LabelEncoder for each categorical column
label_encoders = {column: LabelEncoder() for column in categorical_columns}

# Apply Label Encoding to each categorical column
for column in categorical_columns:
    X[column] = label_encoders[column].fit_transform(X[column])

# Display the first few rows of X to confirm the encoding
print(X.head())


    MSSubClass  MSZoning  LotFrontage  LotArea  Street  LotShape  LandContour  \
Id                                                                              
1           60         3         65.0     8450       1         3            3   
2           20         3         80.0     9600       1         3            3   
3           60         3         68.0    11250       1         0            3   
4           70         3         60.0     9550       1         0            3   
5           60         3         84.0    14260       1         0            3   

    Utilities  LotConfig  LandSlope  ...  GarageQual  GarageCond  PavedDrive  \
Id                                   ...                                       
1           0          4          0  ...           4           4           2   
2           0          2          0  ...           4           4           2   
3           0          4          0  ...           4           4           2   
4           0          0        

## Split the Data into training and test

In [2]:
from sklearn.model_selection import train_test_split

# Split the first dataset (X, y) into train and test sets with a 70% - 30% split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=42)

# Fill NaN values in X_train and X_val with the median of the respective columns
X_train_filled = X_train.fillna(X_train.median())
X_val_filled = X_val.fillna(X_val.median())

(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

((1022, 70), (438, 70), (1022,), (438,))

## First RandomForest Model
This is the code for a simple trial.

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# Create a Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Train the model on the training data
rf_model.fit(X_train_filled, y_train)

# Make predictions on the validation data
y_val_pred_filled = rf_model.predict(X_val_filled)

# Calculate the RMSE on the validation data
rmse_filled = sqrt(mean_squared_error(y_val, y_val_pred_filled))

# Print the RMSE
print(f'RMSE on the validation data: {rmse_filled}')

RMSE on the validation data: 26057.941851126383


### Parameters of Random Forest Model
The three most important parameters that typically have the most impact on the performance of a Random Forest model are:

- *n_estimators*: This parameter specifies the number of trees in the forest. Generally, a higher number of trees increases the performance and makes the predictions more stable, but it also makes the computation slower. Selecting the right number of trees requires balancing between performance and computational efficiency.

- *max_features*: This parameter defines the maximum number of features that are allowed to try in an individual tree. There are several options available for this parameter:

    - *sqrt*: This is commonly used and means that the maximum number of features used at each split is the square root of the total number of features.
    - *log2*: This is another typical option, meaning the log base 2 of the feature count is used.
    - *A specific integer or float*: You can specify an exact number or a proportion of the total.

- *max_depth*: This parameter specifies the maximum depth of each tree. Deeper trees can model more complex patterns, but they also risk overfitting. Limiting the depth of trees can improve the model's generalization and reduce overfitting. It's often useful to set this parameter to a finite value, especially when dealing with a large number of features.

## Finding the best parameters sequentially
<span style="color:red">This code only looks for the best paramters. It is NOT best modeling practieces.</span>


### Define the parameter ranges

In [4]:
n_estimators_range = [10, 25, 50, 100, 200, 300, 400]
max_features_range = ['sqrt', 'log2', None]  # None means using all features
max_depth_range = [1, 2, 5, 10, 20, None]  # None means no limit

### Running the program

In [5]:
import time
from sklearn.metrics import mean_absolute_percentage_error
from tqdm import tqdm

# Starting the timer
start_time = time.time()

# Initialize variables to store the best model and its RMSE and parameters
best_rmse = float('inf')
best_mape = float('inf')
best_model = None
best_parameters = {}

# Loop over all possible combinations of parameters
for n_estimators in tqdm(n_estimators_range):
    for max_features in max_features_range:
        for max_depth in max_depth_range:
            # Create and train the Random Forest model
            rf_model = RandomForestRegressor(
                n_estimators=n_estimators,
                max_features=max_features,
                max_depth=max_depth,
                random_state=42
            )
            rf_model.fit(X_train_filled, y_train)
            
            # Make predictions and compute RMSE
            y_val_pred = rf_model.predict(X_val_filled)
            rmse = sqrt(mean_squared_error(y_val, y_val_pred))
            # Compute MAPE
            mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100
            print(f"The parameters: {n_estimators}, {max_features}, {max_depth}. RMSE: {rmse}, MAPE: {mape}%")
            # If the model is better than the current best, update the best model and its parameters
            if rmse < best_rmse:
                best_rmse = rmse
                best_mape = mape
                best_model = rf_model
                best_parameters = {
                    'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth
                }
print(f"The best parameters {best_parameters} for RMSE = {best_rmse}, MAPE: {mape}%")
end_time = time.time()
sequential_time = end_time - start_time
print(f"The sequential execution time is {sequential_time}")

  0%|                                                                                             | 0/7 [00:00<?, ?it/s]

The parameters: 10, sqrt, 1. RMSE: 62936.129262244904, MAPE: 26.860734529882123%
The parameters: 10, sqrt, 2. RMSE: 51806.105176310826, MAPE: 20.52147680573829%
The parameters: 10, sqrt, 5. RMSE: 34018.6946431472, MAPE: 13.846204059816515%
The parameters: 10, sqrt, 10. RMSE: 30158.362616471855, MAPE: 11.1847373574493%
The parameters: 10, sqrt, 20. RMSE: 30593.76147467334, MAPE: 11.148658516638926%
The parameters: 10, sqrt, None. RMSE: 28608.440009540453, MAPE: 11.124252485645316%
The parameters: 10, log2, 1. RMSE: 64442.107292878405, MAPE: 27.58882909655716%
The parameters: 10, log2, 2. RMSE: 51396.530250771764, MAPE: 21.498982900746558%
The parameters: 10, log2, 5. RMSE: 34956.394381827566, MAPE: 13.945831547616713%
The parameters: 10, log2, 10. RMSE: 30136.927641357106, MAPE: 11.160840785213171%
The parameters: 10, log2, 20. RMSE: 30841.310557247474, MAPE: 11.240205002153882%
The parameters: 10, log2, None. RMSE: 29000.351214559596, MAPE: 10.98098439114081%
The parameters: 10, None, 

 14%|████████████▏                                                                        | 1/7 [00:00<00:03,  1.57it/s]

The parameters: 10, None, None. RMSE: 28062.104755431203, MAPE: 10.28772353916743%
The parameters: 25, sqrt, 1. RMSE: 60720.06723428002, MAPE: 25.2463033064465%
The parameters: 25, sqrt, 2. RMSE: 48999.969822412146, MAPE: 19.141115612087532%
The parameters: 25, sqrt, 5. RMSE: 32317.961786662752, MAPE: 13.114437916762203%
The parameters: 25, sqrt, 10. RMSE: 28259.543521821943, MAPE: 10.47369873275887%
The parameters: 25, sqrt, 20. RMSE: 28601.996953881055, MAPE: 10.651671775420226%
The parameters: 25, sqrt, None. RMSE: 28437.477716170713, MAPE: 10.650717147533856%
The parameters: 25, log2, 1. RMSE: 62175.23336548907, MAPE: 26.103334605968907%
The parameters: 25, log2, 2. RMSE: 49454.96222058965, MAPE: 20.009887677109557%
The parameters: 25, log2, 5. RMSE: 34640.741335575985, MAPE: 13.585957690083541%
The parameters: 25, log2, 10. RMSE: 29797.647023472127, MAPE: 10.802129162615948%
The parameters: 25, log2, 20. RMSE: 28152.423144151206, MAPE: 10.113221227840645%
The parameters: 25, log2,

 29%|████████████████████████▎                                                            | 2/7 [00:02<00:05,  1.11s/it]

The parameters: 25, None, None. RMSE: 27011.528848190042, MAPE: 10.057567711051357%
The parameters: 50, sqrt, 1. RMSE: 60346.5737195233, MAPE: 25.162121091064353%
The parameters: 50, sqrt, 2. RMSE: 48952.98192411672, MAPE: 19.219570888426375%
The parameters: 50, sqrt, 5. RMSE: 32807.733869677046, MAPE: 12.765423020132053%
The parameters: 50, sqrt, 10. RMSE: 28193.908384996157, MAPE: 10.30283712157863%
The parameters: 50, sqrt, 20. RMSE: 28322.13190140527, MAPE: 10.266831234466041%
The parameters: 50, sqrt, None. RMSE: 28253.30753002766, MAPE: 10.301610941463089%
The parameters: 50, log2, 1. RMSE: 61769.11746990103, MAPE: 26.024888854218215%
The parameters: 50, log2, 2. RMSE: 49973.37697266408, MAPE: 19.855069340876533%
The parameters: 50, log2, 5. RMSE: 35272.23782970266, MAPE: 13.259237055635129%
The parameters: 50, log2, 10. RMSE: 29337.548793235695, MAPE: 10.69972715022594%
The parameters: 50, log2, 20. RMSE: 28219.379219234233, MAPE: 10.054841695082303%
The parameters: 50, log2, No

 43%|████████████████████████████████████▍                                                | 3/7 [00:04<00:07,  1.88s/it]

The parameters: 50, None, None. RMSE: 26659.206888314602, MAPE: 9.950072723845874%
The parameters: 100, sqrt, 1. RMSE: 60511.413390408285, MAPE: 24.970655834062548%
The parameters: 100, sqrt, 2. RMSE: 48423.20979308409, MAPE: 19.124218836706603%
The parameters: 100, sqrt, 5. RMSE: 32585.44117019146, MAPE: 12.510275320688514%
The parameters: 100, sqrt, 10. RMSE: 27940.66308571875, MAPE: 10.232467246501514%
The parameters: 100, sqrt, 20. RMSE: 27530.724415300712, MAPE: 10.125866760621907%
The parameters: 100, sqrt, None. RMSE: 27690.85111196373, MAPE: 10.111048513592108%
The parameters: 100, log2, 1. RMSE: 62166.92632864128, MAPE: 25.90574767935297%
The parameters: 100, log2, 2. RMSE: 49946.811096505524, MAPE: 19.85047840173841%
The parameters: 100, log2, 5. RMSE: 34438.66151846642, MAPE: 13.17388786526951%
The parameters: 100, log2, 10. RMSE: 28459.288873741767, MAPE: 10.547559689956204%
The parameters: 100, log2, 20. RMSE: 27471.51821338425, MAPE: 9.805906069975684%
The parameters: 100

 57%|████████████████████████████████████████████████▌                                    | 4/7 [00:10<00:09,  3.31s/it]

The parameters: 100, None, None. RMSE: 26057.941851126383, MAPE: 9.868196740754167%
The parameters: 200, sqrt, 1. RMSE: 60592.20080413971, MAPE: 25.348413331204604%
The parameters: 200, sqrt, 2. RMSE: 48456.200987649645, MAPE: 19.31834333556386%
The parameters: 200, sqrt, 5. RMSE: 32921.85786148226, MAPE: 12.593595077314337%
The parameters: 200, sqrt, 10. RMSE: 28088.902620554447, MAPE: 10.205458456989026%
The parameters: 200, sqrt, 20. RMSE: 27532.91156851607, MAPE: 9.99674498484024%
The parameters: 200, sqrt, None. RMSE: 27430.923405573616, MAPE: 10.001128964051244%
The parameters: 200, log2, 1. RMSE: 62120.39237086007, MAPE: 26.338849401949805%
The parameters: 200, log2, 2. RMSE: 50047.12857344109, MAPE: 20.132262501400174%
The parameters: 200, log2, 5. RMSE: 34734.445288631454, MAPE: 13.310668983679474%
The parameters: 200, log2, 10. RMSE: 28692.19328330196, MAPE: 10.425408664401772%
The parameters: 200, log2, 20. RMSE: 27918.172300177175, MAPE: 9.871948875899832%
The parameters: 2

 71%|████████████████████████████████████████████████████████████▋                        | 5/7 [00:21<00:12,  6.01s/it]

The parameters: 200, None, None. RMSE: 26109.529848542985, MAPE: 9.785264413857124%
The parameters: 300, sqrt, 1. RMSE: 60929.66932059552, MAPE: 25.5233017132497%
The parameters: 300, sqrt, 2. RMSE: 48477.515732693806, MAPE: 19.45952457308847%
The parameters: 300, sqrt, 5. RMSE: 32739.920144412845, MAPE: 12.654260849343446%
The parameters: 300, sqrt, 10. RMSE: 27720.206311955073, MAPE: 10.158753220059861%
The parameters: 300, sqrt, 20. RMSE: 27459.785134152986, MAPE: 9.886952721263405%
The parameters: 300, sqrt, None. RMSE: 27394.84596466395, MAPE: 9.923924926914179%
The parameters: 300, log2, 1. RMSE: 62683.20415726862, MAPE: 26.538128720991566%
The parameters: 300, log2, 2. RMSE: 50458.67950947336, MAPE: 20.275483738507045%
The parameters: 300, log2, 5. RMSE: 34565.6574161325, MAPE: 13.300516763849002%
The parameters: 300, log2, 10. RMSE: 28541.379624482677, MAPE: 10.42483342213683%
The parameters: 300, log2, 20. RMSE: 27744.585234343336, MAPE: 9.924986717080374%
The parameters: 300,

 86%|████████████████████████████████████████████████████████████████████████▊            | 6/7 [00:37<00:09,  9.49s/it]

The parameters: 300, None, None. RMSE: 26176.10432944751, MAPE: 9.801826393115594%
The parameters: 400, sqrt, 1. RMSE: 60914.53857642404, MAPE: 25.51869058914844%
The parameters: 400, sqrt, 2. RMSE: 48669.695843852176, MAPE: 19.530745286106946%
The parameters: 400, sqrt, 5. RMSE: 32809.62783027221, MAPE: 12.71009467191943%
The parameters: 400, sqrt, 10. RMSE: 27744.18030444477, MAPE: 10.125160913453673%
The parameters: 400, sqrt, 20. RMSE: 27497.671359831445, MAPE: 9.905124417060355%
The parameters: 400, sqrt, None. RMSE: 27418.113736521427, MAPE: 9.945984648595795%
The parameters: 400, log2, 1. RMSE: 62661.54638863782, MAPE: 26.43424504101804%
The parameters: 400, log2, 2. RMSE: 50422.97022602301, MAPE: 20.23343384859187%
The parameters: 400, log2, 5. RMSE: 34496.868353168815, MAPE: 13.290480205939964%
The parameters: 400, log2, 10. RMSE: 28518.64779526882, MAPE: 10.390264584164836%
The parameters: 400, log2, 20. RMSE: 27800.35162861101, MAPE: 9.91776916154931%
The parameters: 400, lo

100%|█████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:58<00:00,  8.43s/it]

The parameters: 400, None, None. RMSE: 26362.199455330487, MAPE: 9.83203095544113%
The best parameters {'n_estimators': 100, 'max_features': None, 'max_depth': None} for RMSE = 26057.941851126383, MAPE: 9.83203095544113%
The sequential execution time is 59.00056791305542





## Training Using Threading

This program is training a Random Forest model using threading to speed up the hyperparameter tuning process. 
The code performs the following steps:

- **Import Libraries**: Essential libraries such as time, threading, and components from sklearn are imported.
- **Define Parameter Ranges**: Sets of hyperparameters (n_estimators, max_features, max_depth) are defined for the model to iterate through.
- **Define Evaluation Function**: A function evaluate_model is defined to:
    - Train the Random Forest model with the given parameters.
    - Predict and calculate the Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) on the validation set.
    - Store the results (parameters and performance metrics) in a shared list.
- **Initialize Threads for Model Training**: For each combination of parameters, a new thread is started, calling evaluate_model.
- **Wait for Threads to Complete**: The main thread waits for all the model training threads to complete.
- **Find Best Parameters**: Once all threads are done, the best parameters are determined based on the lowest RMSE.
- **Results and Execution Time**: The best parameters, along with RMSE and MAPE, are printed, and the total time taken for the thread-based parallel execution is displayed.

### Importing

In [6]:
import time
from threading import Thread
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

### Create an evaluation function

In [7]:
from typing import Union
# Function to evaluate a model and return the metrics
def evaluate_model(n_estimators: int = 100, 
                   max_features: str = 'auto', 
                   max_depth: Union[int, None] = None, 
                   results: list = [], 
                   index: int = 0) -> None:
    """
    Trains a Random Forest model with the provided hyperparameters, evaluates it on a validation set,
    and stores the performance metrics and hyperparameters in a shared list.

    Parameters:
        n_estimators (int): The number of trees in the forest.
        max_features (Union[str, None]): The number of features to consider when looking for the best split; 
                                         can be 'sqrt', 'log2', or None (use all features).
        max_depth (Union[int, None]): The maximum depth of the tree; can be an integer or None (no limit).
        results (List[Optional[Tuple]]): A shared list where the function's results (hyperparameters and performance metrics) are stored.
        index (int): The position in the results list where the results of this function call should be stored.

    Returns:
        None: This function does not return a value. It modifies the `results` list in place by adding the performance metrics and hyperparameters.

    Note:
        This function is designed to be run in a separate thread for each combination of hyperparameters,
        facilitating parallel model evaluation.
    """
    
    # Create and train the Random Forest model
    rf_model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features=max_features,
        max_depth=max_depth,
        random_state=42
    )
    rf_model.fit(X_train_filled, y_train)
    
    # Make predictions and compute RMSE
    y_val_pred = rf_model.predict(X_val_filled)
    rmse = sqrt(mean_squared_error(y_val, y_val_pred))
    # Compute MAPE
    mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100
    
    # Store the results
    results[index] = (n_estimators, max_features, max_depth, rmse, mape)

### Running the threading program

In [8]:
start_time = time.time()

# Define the parameter ranges
n_estimators_range = [10, 25, 50, 100, 200, 300, 400]
max_features_range = ['sqrt', 'log2', None]  # None means using all features
max_depth_range = [1, 2, 5, 10, 20, None]  # None means no limit

# Store results from each thread
results = [None] * (len(n_estimators_range) * len(max_features_range) * len(max_depth_range))

# Create and start threads for each combination of parameters
threads = []
index = 0
for n_estimators in n_estimators_range:
    for max_features in max_features_range:
        for max_depth in max_depth_range:
            thread = Thread(target=evaluate_model, args=(n_estimators,
                                                         max_features,
                                                         max_depth,
                                                         results,
                                                         index))
            threads.append(thread)
            thread.start()
            index += 1

# Wait for all threads to finish
for thread in tqdm(threads, desc="Finishing threads: "):
    thread.join()

# Find the best parameters based on RMSE
best_params = None
best_rmse = float('inf')
best_mape = float('inf')
for n_estimators, max_features, max_depth, rmse, mape in tqdm(results, desc="Anlyzing results: "):
    if rmse < best_rmse:
        best_rmse = rmse
        best_mape = mape
        best_params = {
            'n_estimators': n_estimators,
            'max_features': max_features,
            'max_depth': max_depth
        }

print(f"The best parameters {best_params} for RMSE = {best_rmse}, MAPE = {best_mape}%")
end_time = time.time()
threading_time = end_time - start_time
print(f"The thread parallel execution time is {threading_time}")

Finishing threads: 100%|██████████████████████████████████████████████████████████████| 126/126 [00:48<00:00,  2.58it/s]
Anlyzing results: 100%|███████████████████████████████████████████████████████████| 126/126 [00:00<00:00, 689025.17it/s]

The best parameters {'n_estimators': 100, 'max_features': None, 'max_depth': None} for RMSE = 26057.941851126383, MAPE = 9.868196740754167%
The thread parallel execution time is 55.66719841957092





## Training the Processes

1. **Start Timing**:
    - Record the current time to calculate the total execution time once all processes complete.

2. **Prepare for Parallel Execution**:
    - Initialize empty lists to keep track of the Process objects (processes) and the file paths where each process's output will be stored (`file_paths`).

3. **Iterate Over Parameter Combinations**: 
    - Loop through each combination of hyperparameters defined by `n_estimators_range`, `max_features_range`, and `max_depth_range`.
    - For each combination, create a unique file path to store the evaluation results. This ensures that each process's output is saved separately and can be identified easily.

4. **Create and Start Processes**:
    - For every combination of parameters, instantiate a Process object from the multiprocessing module, targeting the `evaluate_model` function with the current combination of parameters and the designated file path for results.
    - Add the created Process object to the processes list for tracking and start the process using `process.start()`. This begins the model evaluation in a separate process.

5. **Wait for Process Completion**:
    - After starting all processes, loop through the processes list and call `process.join()` on each. This ensures the main program waits for all processes to finish their execution before moving forward.

6. **Collect and Clean Up Results**:
    - Iterate over the file_paths list to locate and open each results file created by the processes.
    - Load the evaluation results from each file into a Python object (e.g., a dictionary) and append it to the results list for further analysis.
    - Delete the results file to clean up the file system, ensuring no residual files remain.

7. **Analyze Results (Not fully shown in the snippet)**:
    - Process the collected results to identify the best parameter combination based on the evaluation metrics, such as the lowest Root Mean Squared Error (RMSE) or Mean Absolute Percentage Error (MAPE).

8. **Report Best Parameters and Performance Metrics**:
    - Output the best performing model's parameters and its corresponding evaluation metrics to the console.

9. **Calculate Total Execution Time**:
    - Record the current time again at the end of the execution and calculate the total duration by subtracting the start time from this end time.
    - Print the total parallel execution time to provide insight into the efficiency gained through parallel processing.

### Importing

In [9]:
from multiprocessing import Process
import os
import json

### Rewrinting the evaluation function

In [10]:
# Function to evaluate a model and save the results to a file
def evaluate_model(n_estimators: int = 100, 
                   max_features: str = 'auto', 
                   max_depth: int = 1, 
                   file_path: str = 'model_results.json') -> None:
    """
    Trains a Random Forest regressor with the specified hyperparameters,
    evaluates its performance on a validation set, and saves the results to a file.

    Parameters:
        n_estimators (int): The number of trees in the forest. Default is 100.
        max_features (Union[str, None]): The number of features to consider when looking for the best split.
                                         Can be 'auto', 'sqrt', 'log2', or None. Default is 'auto'.
        max_depth (Union[int, None]): The maximum depth of the trees. Default is None, meaning no limit.
        file_path (str): Path to the file where the results will be saved. Default is 'model_results.json'.

    This function does not return any value. It writes the model's performance metrics and the hyperparameters
    used for training into a JSON file specified by file_path.
    """
    
    # Create and train the Random Forest model with given hyperparameters
    rf_model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_features=max_features,
        max_depth=max_depth,
        random_state=42  # Ensures reproducible results
    )
    rf_model.fit(X_train_filled, y_train)  # Assuming X_train_filled and y_train are pre-defined
    
    # Make predictions on the validation set
    y_val_pred = rf_model.predict(X_val_filled)  # Assuming X_val_filled is pre-defined
    
    # Calculate the Root Mean Squared Error (RMSE)
    rmse = sqrt(mean_squared_error(y_val, y_val_pred))  # Assuming y_val is pre-defined
    
    # Calculate the Mean Absolute Percentage Error (MAPE)
    mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100  # Assuming y_val is pre-defined
    
    # Prepare the results dictionary
    result_dictionary = {
        'n_estimators': n_estimators,
        'max_features': max_features,
        'max_depth': max_depth,
        'rmse': rmse,
        'mape': mape
    }
    
    # Save the results dictionary to a file in JSON format
    with open(file_path, 'w') as file:
        json.dump(result_dictionary, file)

### Running the multiprocesses program

In [19]:
# Start timing the execution
start_time = time.time()  

# Initialize lists to keep track of processes and file paths for results
processes = []
file_paths = []

# Assuming n_estimators_range, max_features_range, and max_depth_range are predefined lists
for n_estimators in n_estimators_range:
    for max_features in max_features_range:
        for max_depth in max_depth_range:
            # Construct a file path for each combination of parameters
            file_path = f'results_{n_estimators}_{max_features}_{max_depth}.json'
            # Create a process targeting the evaluate_model function with the current combination of parameters
            process = Process(target=evaluate_model,
                              args=(n_estimators,
                                    max_features,
                                    max_depth,
                                    file_path))
            processes.append(process)  # Add the process to the list
            file_paths.append(file_path)  # Keep track of where the results will be stored
            process.start()  # Start the process

# Wait for all processes to complete
for process in tqdm(processes, desc="Finishing processes: "):
    process.join()

# Gather the results from each file
results = []
for file_path in tqdm(file_paths, desc="Anlyzing results: "):
    if os.path.exists(file_path):
        with open(file_path, 'r') as f:
            results.append(json.load(f))  # Load the results and add them to the list
        os.remove(file_path)  # Clean up by removing the file after reading
    else:
        print(f"Warning: Missing results file {file_path}")

# Assuming there is logic here to process 'results' and find the best parameters based on RMSE

# Output the best parameters and performance metrics
print(f"The best parameters {best_params} for RMSE = {best_rmse}, MAPE = {best_mape}%")

end_time = time.time()  # End timing the execution
processes_time = end_time - start_time
print(f"The processes parallel execution time is {processes_time}")  # Print the total parallel execution time

Finishing processes: 100%|████████████████████████████████████████████████████████████| 126/126 [00:15<00:00,  7.88it/s]
Anlyzing results: 100%|█████████████████████████████████████████████████████████████| 126/126 [00:00<00:00, 3033.52it/s]

The best parameters {'n_estimators': 100, 'max_features': None, 'max_depth': None} for RMSE = 26057.941851126383, MAPE = 9.868196740754167%
The processes parallel execution time is 26.382996559143066





## Computing the performance metrics

### Speedup
Speedup in the context of parallel computing is a measure used to quantify the performance improvement of a parallel algorithm over a serial algorithm for solving a particular problem. Specifically, it tells us how much faster a parallel algorithm runs compared to its serial counterpart. The concept of speedup is crucial for understanding the benefits and efficiency of parallel processing.

Mathematically, speedup (S) is defined as:

$$
S = \frac{T_{serial}}{T_{parallel}}
$$

Where:
- $T_{serial}$ is the execution time of the best-known serial algorithm for solving a problem.
- $T_{parallel}$ is the execution time of the parallel algorithm solving the same problem with $np$ processors or cores.

##### Interpretation

- $S > 1$: The parallel algorithm is faster than the serial algorithm. The larger the value of $S$, the better the performance improvement.
- $S = 1$: The parallel algorithm offers no speed improvement over the serial algorithm. This situation may occur due to overheads in parallelization that negate the benefits of running computations in parallel.
- $S < 1$: The parallel algorithm is slower than the serial algorithm, which indicates a poor parallelization strategy where the overheads dominate the computational benefits.

##### Ideal and Realistic Speedup

- **Linear Speedup**: Ideally, if a problem is perfectly parallelizable, speedup would be equal to $np$ (the number of processors), meaning the task would run $np$ times faster than the serial version. This is known as linear speedup.
- **Superlinear Speedup**: In some rare cases, speedup can be greater than $np$, which is known as superlinear speedup. This can happen due to several factors, such as more efficient use of cache memory in the parallel algorithm.
- **Sublinear Speedup**: More commonly, the speedup is less than $np$ due to overheads such as communication among processors, synchronization, and data distribution. This is known as sublinear speedup.

In [20]:
import multiprocessing
# Get the number of CPUs available
number_of_cpus = multiprocessing.cpu_count()
print(f"The number of the available CPUs is {number_of_cpus}")

# Computing the speedup:
speedup_threading = sequential_time / threading_time
print(f"The speedup in the threading case is {speedup_threading}.")

The number of the available CPUs is 4
The speedup in the threading case is 1.0598803171009337.


In [21]:
speedup_processes = sequential_time / processes_time
print(f"The speedup in the multiprocessing case is {speedup_processes}.")

The speedup in the multiprocessing case is 2.236310336499994.


### Efficiency

Efficiency in the context of parallel computing is a measure that evaluates how well the computational resources (processors or cores) are utilized when executing a parallel algorithm. It's a way to quantify the effectiveness of parallelization by considering both the speedup achieved and the number of processors used.


Efficiency (E) is defined as the speedup (S) divided by the number of processors ($n$) used in the parallel algorithm:

$$E = \frac{S}{np} = \frac{T_{serial}}{T_{parallel} \times np}$$

Where:
- $S$ is the speedup, which is the ratio of the execution time of the best serial algorithm ($T_{serial}$) to the execution time of the parallel algorithm ($T_{parallel}$).
- $np$ is the number of processors (or cores) used in the parallel execution.

#### Interpretation
- **$E = 1$**: This is the ideal scenario where the parallel algorithm achieves perfect linear speedup, indicating that all processors are being used efficiently without any waste of computational resources.
- **$0 < E < 1$**: This is the most common scenario, indicating sublinear speedup. It means that there are diminishing returns on adding more processors, likely due to overheads such as communication among processors, load imbalance, or idle time.
- **$E > 1$**: Although rare, superlinear efficiency can occur, indicating that the parallel algorithm with $np$ processors is more than $np$ times faster than the serial algorithm


In [22]:
# Computing the Efficiency for threading
efficiency_threading = speedup_threading / number_of_cpus
print(f"The efficiency in the threading case is {efficiency_threading}.")

The efficiency in the threading case is 0.2649700792752334.


In [23]:
# Computing the Efficiency for multiprocessing
efficiency_processes = speedup_processes / number_of_cpus
print(f"The efficiency in the threading case is {efficiency_processes}.")

The efficiency in the threading case is 0.5590775841249985.


### Amdahl's Law

Amdahl's Law is used to find the maximum improvement to an overall system when only part of the system is improved. It is formulated as:

$$
S_{Amdahl} = \frac{1}{(1 - P)}
$$

where:
- $P$ is the proportion of the execution time that the improved system affects (i.e., the parallelizable portion).

# Interpretations of Amdahl's Law

Amdahl's Law offers critical insights into the limits of parallel computing and system optimization. Here are the key interpretations:

1. **Limited Speedup**: Amdahl's Law highlights that the speedup of a system due to parallelization is limited by the sequential portion of the task. Even if the parallelizable portion is executed infinitely fast, the speedup will have an upper limit determined by the sequential part.

2. **Diminishing Returns**: As more processors are added, the benefit of adding additional processors decreases if the sequential portion of the task remains constant. This is because the speedup gains are primarily achieved in the parallelizable portion, which becomes a smaller fraction of the total execution time as it gets optimized.

3. **Importance of Parallelizable Code**: The law emphasizes the importance of minimizing the sequential portion of a task to achieve significant speedup through parallelization. The more a task can be parallelized, the greater the potential speedup from adding more processors.

4. **Bottleneck of Sequential Execution**: The sequential portion of a task acts as a bottleneck for overall system speedup. Reducing the sequential part through algorithm optimization or problem decomposition is crucial for enhancing performance in parallel computing environments.


<span style="font-size: larger;"><span style="color:red">In the following the code runs for 1 iteration, showcasing is the part that **cannot** be made parallel.</span></span>

In [None]:
# Computing the Amdhal speedup for threading
# The seauenctial execution time
print(f"The sequential time is: {sequential_time}")
print(f"Let's run only the sequential part that connot be parallelized: ")

In [31]:
import time
from sklearn.metrics import mean_absolute_percentage_error
from tqdm import tqdm

# Starting the timer
start_time = time.time()

# Initialize variables to store the best model and its RMSE and parameters
best_rmse = float('inf')
best_mape = float('inf')
best_model = None
best_parameters = {}

# # Loop over all possible combinations of parameters
for n_estimators in tqdm(n_estimators_range[:1]):
    for max_features in max_features_range[:1]:
        for max_depth in max_depth_range[:1]:
            # Create and train the Random Forest model
            rf_model = RandomForestRegressor(
                n_estimators=n_estimators,
                max_features=max_features,
                max_depth=max_depth,
                random_state=42
            )
            rf_model.fit(X_train_filled, y_train)
            
            # Make predictions and compute RMSE
            y_val_pred = rf_model.predict(X_val_filled)
            rmse = sqrt(mean_squared_error(y_val, y_val_pred))
            # Compute MAPE
            mape = mean_absolute_percentage_error(y_val, y_val_pred) * 100
            print(f"The parameters: {n_estimators}, {max_features}, {max_depth}. RMSE: {rmse}, MAPE: {mape}%")
            # If the model is better than the current best, update the best model and its parameters
            if rmse < best_rmse:
                best_rmse = rmse
                best_mape = mape
                best_model = rf_model
                best_parameters = {
                    'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth
                }
print(f"The best parameters {best_parameters} for RMSE = {best_rmse}, MAPE: {mape}%")
end_time = time.time()
non_parallel_time = end_time - start_time
print(f"The sequential execution of non parallel time is {non_parallel_time}")

100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.84it/s]

The parameters: 10, sqrt, 1. RMSE: 62936.129262244904, MAPE: 26.860734529882123%
The best parameters {'n_estimators': 10, 'max_features': 'sqrt', 'max_depth': 1} for RMSE = 62936.129262244904, MAPE: 26.860734529882123%
The sequential execution of non parallel time is 0.01868748664855957





In [33]:
# Computing Amdhal speedup
parallel_portion = 1 - (non_parallel_time / sequential_time)
S_amdhal = 1 / (1 - parallel_portion)
print(f"The speedup according to Amdhal should be {S_amdhal}.")

The speedup according to Amdhal should be 3157.2232811519043.


### Amdahl's Law: the version dependant on the number of processors

There is a second enhanced version of the law that measures the speedup according to the number of processors:

$$
S_{Amdahl} = \frac{1}{(1 - P) + \frac{P}{np}}
$$

where:
- $P$ is the proportion of the execution time that the improved system affects (i.e., the parallelizable portion).
- $np$ is the speedup of the portion of the task that benefits from the improved system resources (i.e., the number of processors).

#### Let's recompute the law for with the new formula:

In [34]:
S_amdhal = 1 / ((1 - parallel_portion) + (parallel_portion / number_of_cpus))
print(f"The speedup according to Amdhal should be {S_amdhal}.")

The speedup according to Amdhal should be 3.9962027999503804.


<span style="font-size: larger;"><span style="color:green">**See how this result is less theoritical and more realistic.**</span></span>

### Gustafson's Law

Gustafson's Law provides a more optimistic view, suggesting that the speedup is linearly proportional to the number of processors when the workload scales with the number of processors. It is given by:

$$
\text{Speedup} = (1 - P) + P \cdot np = \alpha + P \cdot np
$$

where:
- $P$ and $np$ have the same meaning as in Amdahl's Law.
- $\alpha = 1 - P$: the serial part of the program.

To compute these laws' predictions for speedup, we need:
- The value of $P$ (the parallelizable portion of the task).
- The number of processors $np$.

# Interpretations of Gustafson's Law

Gustafson's Law offers an optimistic perspective on the scalability of parallel computing, highlighting several key points:

1. **Scalable Problem Size**: Unlike Amdahl's Law, which focuses on a fixed workload, Gustafson's Law assumes that the total workload can increase with the number of processors. This reflects more realistic scenarios where larger computational resources enable tackling more significant problems.

2. **Linear Speedup**: Gustafson's Law suggests that the speedup of an algorithm can be nearly linear with respect to the number of processors, provided that the workload scales accordingly. This contrasts with Amdahl's Law, which predicts a theoretical upper limit to speedup due to the sequential portion of a task.

3. **Reduced Impact of Sequential Components**: While Amdahl's Law emphasizes the limitation imposed by the sequential portion of a task, Gustafson's Law implies that increasing the problem size makes the fixed sequential time a smaller fraction of the total execution time, thus reducing its impact on the overall speedup.

4. **Importance of Parallelizable Workload**: The law highlights the importance of designing algorithms and workloads that can be effectively parallelized. As the parallel portion of the workload increases, the potential for speedup grows, making efficient parallel algorithm design crucial for leveraging modern multi-processor systems.

In [38]:
S_gustafson = (1 - parallel_portion) + (parallel_portion * number_of_cpus)
print(f"The speedup according to Gustafson should be {S_gustafson}.")

The speedup according to Gustafson should be 3.9990497979607875.
