# Feature Engineering: Extraction and Selection in Bi-Dimensional data

## Overview

This document outlines an exploration into the phase of data preparation and feature engineering within the context of *predictive modeling*. The primary aim of this project is to showcase feature engineering techniques to enhance the performance of machine learning models. By extracting, transforming, and selecting features from raw data with high-dimentionality, we aim to uncover more meaningful patterns and insights that can significantly improve model accuracy and interpretability.

## Project Objectives

The note unfolds through several key stages:

1. **Data Reading and Preparation**: We begin by importing and consolidating our raw dataset from different sources, ensuring that our data is organized and accessible for analysis.

2. **Exploratory Data Analysis (EDA)**: Through descriptive statistics and visualizations, we gain a foundational understanding of the data's characteristics and distributions, setting the stage for more informed feature engineering decisions.

3. **Feature Extraction**: This stage involves the creation of new features from the existing data, utilizing domain knowledge and data analysis insights to craft variables that are potentially more predictive of the outcome.

4. **Feature Transformation and Selection**: We apply various techniques to modify and select the most relevant features, aiming to enhance model performance while reducing dimensionality and complexity.

5. **Modeling and Evaluation**: With our refined feature set, we train several machine learning models, including RandomForestRegressor, SVR, and Lasso, to predict outcomes with higher accuracy. This phase also involves rigorous error analysis to assess and improve model performance.

6. **Feature Importance and Optimization**: By analyzing feature importance and employing additional selection techniques like ReliefF, we further refine our feature set, focusing on variables that offer the most value to our predictive models.

7. **Conclusion and Future Work**: The project concludes with a synthesis of our findings, insights into the impact of feature engineering on model performance, and considerations for future explorations in this domain.

## Imports

In [10]:
import numpy as np
import os
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tqdm import tqdm

## Reading the data

In [11]:
def read_and_combine(folder: str = "training"):
    """
    Reads and combines data from a specified folder into a pandas DataFrame.
    
    This function is designed to streamline the process of importing and consolidating
    datasets for analysis. It reads data files from the specified directory, 
    combines them into a single DataFrame, and returns the combined data for further processing.
    
    Parameters:
    - folder (str, optional): The name of the folder from which to read the data files. 
      This can be set to "training", "validation", or "test" to specify the dataset 
      to be read. The default value is "training".
    
    Returns:
    - pandas.DataFrame: A DataFrame containing the combined data from all files 
      within the specified folder.
    
    Example:
    >>> training_data = read_and_combine("training")
    >>> print(training_data.head())
    
    Note:
    - Ensure that the specified folder contains data files in a format that can be 
      read directly into a pandas DataFrame (e.g., CSV files).
    - The function assumes all files in the folder are relevant and should be combined.
    """
    # Path to the folder containing CSV files
    folder_path = "data/" + folder
    # List to store DataFrames from each CSV file
    dataFrames = []
    # Loop through all files in the folder
    for filename in tqdm(os.listdir(folder_path)):
        if filename.endswith('.csv'):
            file_path = os.path.join(folder_path, filename)
            # Read each CSV file into a DataFrame and append to the list
            dataframe = pd.read_csv(file_path)
            dataFrames.append(dataframe)
    # Combine all DataFrames into one
    combined_dataframe = pd.concat(dataFrames, ignore_index=True)
    # Save the combined DataFrame to a new CSV file
    combined_dataframe.to_csv('combined_output.csv', index=False)
    # Display the combined DataFrame
    return combined_dataframe

In [12]:
training_data = read_and_combine()
training_data.shape

100%|██████████| 185/185 [00:00<00:00, 206.03it/s]
  combined_dataframe = pd.concat(dataFrames, ignore_index=True)


(672744, 25)

In [13]:
validation_data = read_and_combine("validation")
validation_data.shape

100%|██████████| 185/185 [00:00<00:00, 554.04it/s]
  combined_dataframe = pd.concat(dataFrames, ignore_index=True)


(144148, 25)

In [14]:
test_data = read_and_combine("test")
test_data.shape

100%|██████████| 185/185 [00:00<00:00, 598.48it/s]
  combined_dataframe = pd.concat(dataFrames, ignore_index=True)


(156262, 25)

## Let's the check the type of data we have

In [15]:
training_data.dtypes

MACHINE_ID                       object
MACHINE_DATA                     object
TIMESTAMP                       float64
WAFER_ID                         object
STAGE                            object
CHAMBER                         float64
USAGE_OF_BACKING_FILM           float64
USAGE_OF_DRESSER                float64
USAGE_OF_POLISHING_TABLE        float64
USAGE_OF_DRESSER_TABLE          float64
PRESSURIZED_CHAMBER_PRESSURE    float64
MAIN_OUTER_AIR_BAG_PRESSURE     float64
CENTER_AIR_BAG_PRESSURE         float64
RETAINER_RING_PRESSURE          float64
RIPPLE_AIR_BAG_PRESSURE         float64
USAGE_OF_MEMBRANE               float64
USAGE_OF_PRESSURIZED_SHEET      float64
SLURRY_FLOW_LINE_A              float64
SLURRY_FLOW_LINE_B              float64
SLURRY_FLOW_LINE_C              float64
WAFER_ROTATION                  float64
STAGE_ROTATION                  float64
HEAD_ROTATION                   float64
DRESSING_WATER_STATUS           float64
EDGE_AIR_BAG_PRESSURE           float64


## Dataframe description

In [16]:
training_data.describe()

Unnamed: 0,TIMESTAMP,CHAMBER,USAGE_OF_BACKING_FILM,USAGE_OF_DRESSER,USAGE_OF_POLISHING_TABLE,USAGE_OF_DRESSER_TABLE,PRESSURIZED_CHAMBER_PRESSURE,MAIN_OUTER_AIR_BAG_PRESSURE,CENTER_AIR_BAG_PRESSURE,RETAINER_RING_PRESSURE,...,USAGE_OF_MEMBRANE,USAGE_OF_PRESSURIZED_SHEET,SLURRY_FLOW_LINE_A,SLURRY_FLOW_LINE_B,SLURRY_FLOW_LINE_C,WAFER_ROTATION,STAGE_ROTATION,HEAD_ROTATION,DRESSING_WATER_STATUS,EDGE_AIR_BAG_PRESSURE
count,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,...,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0,672744.0
mean,484418600.0,4.223673,4968.532485,396.444964,171.983843,3496.348712,49.973427,155.327976,40.147023,1218.777316,...,58.915409,1490.559854,4.245952,0.725417,249.354458,12.802433,52.43756,159.792734,0.424763,28.5317
std,1639134.0,1.333534,2888.628864,219.524524,94.623563,479.742809,39.241073,133.191797,34.240954,1499.216737,...,34.252516,866.588654,6.683546,0.420575,214.034647,16.325427,91.87822,8.889108,0.494307,24.346485
min,481634400.0,1.0,19.166667,5.185185,0.0,2664.75,0.0,0.0,0.0,0.0,...,0.227273,5.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,482773600.0,4.0,2425.0,205.185185,88.888889,3041.0,0.0,0.0,0.0,0.0,...,28.754941,727.5,2.222222,0.909091,0.0,0.0,0.0,156.8,0.0,0.0
50%,484653400.0,4.0,5036.666667,395.925926,172.592593,3544.75,72.857143,252.0,61.25,1446.9,...,59.72332,1511.0,2.222222,0.909091,411.6,0.0,0.0,160.0,0.0,43.939394
75%,485799100.0,5.0,7322.5,590.37037,254.074074,3912.0,77.142857,268.8,66.25,1454.7,...,86.828063,2196.75,2.222222,0.909091,439.6,34.651163,65.526316,160.0,1.0,48.484848
max,487268200.0,6.0,10532.5,771.851852,357.037037,4305.5,189.047619,499.2,139.375,10662.6,...,124.891304,3159.75,42.638889,12.5,1083.6,34.883721,263.552632,192.0,1.0,141.515152


## Feature Extraction
### Using a function
- Writing the function

In [17]:
import numpy as np
import pandas as pd
from tqdm import tqdm

def extract_features(data: pd.DataFrame,
                    id_column: str = "WAFER_ID",
                    non_extracted_columns: list = ["TIMESTAMP", "WAFER_ID", "CHAMBER"]):
    """
    Extract statistical features for each unique ID and Stage combination from the provided DataFrame.
    
    Parameters:
    - data (pd.DataFrame): The DataFrame containing the data to process.
    - id_column (str): The name of the column in 'data' that contains the unique ID for each entity.
    - non_extracted_columns (list): A list of column names to exclude from feature extraction.
    
    Returns:
    - pd.DataFrame: A new DataFrame where each row contains the extracted features for each unique ID and Stage combination.
    """
    
    data_rows = []
    for (wafer, stage), group in tqdm(data.groupby([id_column, "STAGE"])):
        wafer_stage_data = group.copy()
        # Iterate through each numerical column and calculate the features using numpy
        features_np = {}
        for column in wafer_stage_data.select_dtypes(include='number').columns:
            if column in non_extracted_columns:
                continue
            col_data = wafer_stage_data[column].values  # Convert the column to a numpy array
            features_np.update({
                f'{column}_Mean': np.mean(col_data),
                f'{column}_Median': np.median(col_data),
                f'{column}_StdDev': np.std(col_data, ddof=1),
                f'{column}_Variance': np.var(col_data, ddof=1),
                f'{column}_Minimum': np.min(col_data),
                f'{column}_Maximum': np.max(col_data),
                f'{column}_Range': np.ptp(col_data),
                f'{column}_Skewness': pd.Series(col_data).skew(),
                f'{column}_Kurtosis': pd.Series(col_data).kurt(),
                f'{column}_25thPercentile': np.percentile(col_data, 25),
                f'{column}_50thPercentile': np.percentile(col_data, 50),
                f'{column}_75thPercentile': np.percentile(col_data, 75)
            })
        # Convert the features dictionary to a DataFrame
        feature_df = pd.DataFrame([features_np])
        feature_df.insert(0, id_column, wafer)
        feature_df.insert(1, "STAGE", stage)
        if "CHAMBER" in data.columns:
            feature_df.insert(2, "CHAMBER", np.unique(wafer_stage_data["CHAMBER"])[0])

        data_rows.append(feature_df)
    
    extracted_data = pd.concat(data_rows, ignore_index=True)
    return extracted_data


- Extracting the features

In [18]:
training_set = extract_features(training_data)
training_set.dtypes

100%|██████████| 1981/1981 [00:10<00:00, 192.36it/s]


WAFER_ID                                  int64
STAGE                                    object
CHAMBER                                 float64
USAGE_OF_BACKING_FILM_Mean              float64
USAGE_OF_BACKING_FILM_Median            float64
                                         ...   
EDGE_AIR_BAG_PRESSURE_Skewness          float64
EDGE_AIR_BAG_PRESSURE_Kurtosis          float64
EDGE_AIR_BAG_PRESSURE_25thPercentile    float64
EDGE_AIR_BAG_PRESSURE_50thPercentile    float64
EDGE_AIR_BAG_PRESSURE_75thPercentile    float64
Length: 231, dtype: object

In [19]:
validation_set = extract_features(validation_data)
validation_set.dtypes

100%|██████████| 424/424 [00:02<00:00, 191.66it/s]


WAFER_ID                                  int64
STAGE                                    object
CHAMBER                                 float64
USAGE_OF_BACKING_FILM_Mean              float64
USAGE_OF_BACKING_FILM_Median            float64
                                         ...   
EDGE_AIR_BAG_PRESSURE_Skewness          float64
EDGE_AIR_BAG_PRESSURE_Kurtosis          float64
EDGE_AIR_BAG_PRESSURE_25thPercentile    float64
EDGE_AIR_BAG_PRESSURE_50thPercentile    float64
EDGE_AIR_BAG_PRESSURE_75thPercentile    float64
Length: 231, dtype: object

In [20]:
test_set = extract_features(test_data)
test_data.dtypes

100%|██████████| 424/424 [00:02<00:00, 194.14it/s]


MACHINE_ID                       object
MACHINE_DATA                     object
TIMESTAMP                       float64
WAFER_ID                         object
STAGE                            object
CHAMBER                         float64
USAGE_OF_BACKING_FILM           float64
USAGE_OF_DRESSER                float64
USAGE_OF_POLISHING_TABLE        float64
USAGE_OF_DRESSER_TABLE          float64
PRESSURIZED_CHAMBER_PRESSURE    float64
MAIN_OUTER_AIR_BAG_PRESSURE     float64
CENTER_AIR_BAG_PRESSURE         float64
RETAINER_RING_PRESSURE          float64
RIPPLE_AIR_BAG_PRESSURE         float64
USAGE_OF_MEMBRANE               float64
USAGE_OF_PRESSURIZED_SHEET      float64
SLURRY_FLOW_LINE_A              float64
SLURRY_FLOW_LINE_B              float64
SLURRY_FLOW_LINE_C              float64
WAFER_ROTATION                  float64
STAGE_ROTATION                  float64
HEAD_ROTATION                   float64
DRESSING_WATER_STATUS           float64
EDGE_AIR_BAG_PRESSURE           float64


In [21]:
def add_output_column(data: pd.DataFrame(),
                      data_name: str = "training"):
    output_data = pd.read_csv("data/CMP-" + data_name + "-removalrate.csv")
    data = pd.merge(data, output_data, on=['WAFER_ID', 'STAGE'])
    return data   

In [22]:
training_set = add_output_column(training_set)

In [23]:
training_set['STAGE'] = training_set['STAGE'].replace({'A': 0, 'B': 1})
training_set.head()

  training_set['STAGE'] = training_set['STAGE'].replace({'A': 0, 'B': 1})


Unnamed: 0,WAFER_ID,STAGE,CHAMBER,USAGE_OF_BACKING_FILM_Mean,USAGE_OF_BACKING_FILM_Median,USAGE_OF_BACKING_FILM_StdDev,USAGE_OF_BACKING_FILM_Variance,USAGE_OF_BACKING_FILM_Minimum,USAGE_OF_BACKING_FILM_Maximum,USAGE_OF_BACKING_FILM_Range,...,EDGE_AIR_BAG_PRESSURE_Variance,EDGE_AIR_BAG_PRESSURE_Minimum,EDGE_AIR_BAG_PRESSURE_Maximum,EDGE_AIR_BAG_PRESSURE_Range,EDGE_AIR_BAG_PRESSURE_Skewness,EDGE_AIR_BAG_PRESSURE_Kurtosis,EDGE_AIR_BAG_PRESSURE_25thPercentile,EDGE_AIR_BAG_PRESSURE_50thPercentile,EDGE_AIR_BAG_PRESSURE_75thPercentile,AVG_REMOVAL_RATE
0,-4230160598,0,4.0,890.069846,890.833333,3.709068,13.757185,884.166667,896.666667,12.5,...,422.571307,0.0,57.878788,57.878788,-1.254628,-0.261978,48.484848,48.484848,48.484848,68.8818
1,-4230160594,1,4.0,1291.998698,1293.333333,4.324576,18.701959,1285.833333,1298.333333,12.5,...,834.453487,0.0,106.363636,106.363636,-0.446917,-0.612414,43.939394,44.242424,70.0,70.0533
2,-4230160436,1,4.0,3272.829619,3273.333333,3.423431,11.719881,3266.666667,3277.5,10.833333,...,1867.577448,0.0,141.515152,141.515152,1.238519,0.891666,0.0,43.939394,44.242424,54.3072
3,-4230160428,0,4.0,5922.780214,5924.166667,2.480098,6.150885,5918.333333,5925.0,6.666667,...,587.820962,0.0,57.878788,57.878788,0.395049,-1.796705,0.0,0.0,48.484848,75.34995
4,-4230160424,0,4.0,4868.350291,4870.0,2.159261,4.66241,4864.166667,4870.0,5.833333,...,598.141525,0.0,57.878788,57.878788,0.430882,-1.787898,0.0,0.0,48.787879,78.33015


In [24]:
test_set = add_output_column(test_set,data_name="test")
test_set['STAGE'] = test_set['STAGE'].replace({'A': 0, 'B': 1})
test_set.head()

  test_set['STAGE'] = test_set['STAGE'].replace({'A': 0, 'B': 1})


Unnamed: 0,WAFER_ID,STAGE,CHAMBER,USAGE_OF_BACKING_FILM_Mean,USAGE_OF_BACKING_FILM_Median,USAGE_OF_BACKING_FILM_StdDev,USAGE_OF_BACKING_FILM_Variance,USAGE_OF_BACKING_FILM_Minimum,USAGE_OF_BACKING_FILM_Maximum,USAGE_OF_BACKING_FILM_Range,...,EDGE_AIR_BAG_PRESSURE_Variance,EDGE_AIR_BAG_PRESSURE_Minimum,EDGE_AIR_BAG_PRESSURE_Maximum,EDGE_AIR_BAG_PRESSURE_Range,EDGE_AIR_BAG_PRESSURE_Skewness,EDGE_AIR_BAG_PRESSURE_Kurtosis,EDGE_AIR_BAG_PRESSURE_25thPercentile,EDGE_AIR_BAG_PRESSURE_50thPercentile,EDGE_AIR_BAG_PRESSURE_75thPercentile,AVG_REMOVAL_RATE
0,-4226160404,0,4.0,10060.889356,10061.666667,3.866124,14.946915,10055.0,10067.5,12.5,...,396.731174,0.0,57.575758,57.575758,-1.375061,0.079085,48.484848,48.484848,48.484848,60.44715
1,-4224160686,0,4.0,9614.945238,9615.0,3.74794,14.047056,9609.166667,9621.666667,12.5,...,362.786997,0.0,57.878788,57.878788,-1.532967,0.580493,48.484848,48.787879,48.787879,57.2523
2,-4224160678,1,4.0,9526.132959,9526.666667,3.834909,14.706529,9520.0,9533.333333,13.333333,...,337.359492,0.0,57.878788,57.878788,-1.507054,0.669936,43.939394,44.242424,48.787879,66.9813
3,-4224160592,0,4.0,9332.545761,9332.916667,3.690949,13.623101,9326.666667,9339.166667,12.5,...,354.062453,0.0,57.878788,57.878788,-1.393967,0.326083,43.939394,44.242424,48.787879,56.1786
4,-4222160444,1,4.0,3375.676944,3375.833333,4.013973,16.111976,3369.166667,3382.5,13.333333,...,269.665701,0.0,57.878788,57.878788,-1.74568,1.540839,43.939394,43.939394,43.939394,60.8757


In [25]:
training_set.to_csv('training_set.csv', index=False)
test_set.to_csv('test_set.csv', index=False)

In [26]:
training_set

Unnamed: 0,WAFER_ID,STAGE,CHAMBER,USAGE_OF_BACKING_FILM_Mean,USAGE_OF_BACKING_FILM_Median,USAGE_OF_BACKING_FILM_StdDev,USAGE_OF_BACKING_FILM_Variance,USAGE_OF_BACKING_FILM_Minimum,USAGE_OF_BACKING_FILM_Maximum,USAGE_OF_BACKING_FILM_Range,...,EDGE_AIR_BAG_PRESSURE_Variance,EDGE_AIR_BAG_PRESSURE_Minimum,EDGE_AIR_BAG_PRESSURE_Maximum,EDGE_AIR_BAG_PRESSURE_Range,EDGE_AIR_BAG_PRESSURE_Skewness,EDGE_AIR_BAG_PRESSURE_Kurtosis,EDGE_AIR_BAG_PRESSURE_25thPercentile,EDGE_AIR_BAG_PRESSURE_50thPercentile,EDGE_AIR_BAG_PRESSURE_75thPercentile,AVG_REMOVAL_RATE
0,-4230160598,0,4.0,890.069846,890.833333,3.709068,13.757185,884.166667,896.666667,12.500000,...,422.571307,0.0,57.878788,57.878788,-1.254628,-0.261978,48.484848,48.484848,48.484848,68.88180
1,-4230160594,1,4.0,1291.998698,1293.333333,4.324576,18.701959,1285.833333,1298.333333,12.500000,...,834.453487,0.0,106.363636,106.363636,-0.446917,-0.612414,43.939394,44.242424,70.000000,70.05330
2,-4230160436,1,4.0,3272.829619,3273.333333,3.423431,11.719881,3266.666667,3277.500000,10.833333,...,1867.577448,0.0,141.515152,141.515152,1.238519,0.891666,0.000000,43.939394,44.242424,54.30720
3,-4230160428,0,4.0,5922.780214,5924.166667,2.480098,6.150885,5918.333333,5925.000000,6.666667,...,587.820962,0.0,57.878788,57.878788,0.395049,-1.796705,0.000000,0.000000,48.484848,75.34995
4,-4230160424,0,4.0,4868.350291,4870.000000,2.159261,4.662410,4864.166667,4870.000000,5.833333,...,598.141525,0.0,57.878788,57.878788,0.430882,-1.787898,0.000000,0.000000,48.787879,78.33015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1976,4229773726,1,4.0,4366.857689,4367.500000,3.724011,13.868259,4360.833333,4373.333333,12.500000,...,381.257203,0.0,57.878788,57.878788,-1.258567,-0.097726,43.939394,43.939394,48.484848,76.78335
1977,4229773730,1,4.0,2492.143836,2492.500000,3.889432,15.127682,2485.833333,2499.166667,13.333333,...,334.507214,0.0,57.878788,57.878788,-1.321976,0.225655,43.939394,43.939394,43.939394,64.67670
1978,4229773746,0,4.0,2176.879179,2177.500000,3.760443,14.140932,2170.833333,2183.333333,12.500000,...,517.579888,0.0,60.909091,60.909091,-1.139250,-0.405617,48.484848,48.484848,60.606061,71.10945
1979,4229773746,1,4.0,2415.961187,2416.666667,3.844557,14.780617,2410.000000,2422.500000,12.500000,...,315.222355,0.0,57.878788,57.878788,-1.465598,0.658129,43.939394,43.939394,44.242424,65.95260


In [27]:
test_set

Unnamed: 0,WAFER_ID,STAGE,CHAMBER,USAGE_OF_BACKING_FILM_Mean,USAGE_OF_BACKING_FILM_Median,USAGE_OF_BACKING_FILM_StdDev,USAGE_OF_BACKING_FILM_Variance,USAGE_OF_BACKING_FILM_Minimum,USAGE_OF_BACKING_FILM_Maximum,USAGE_OF_BACKING_FILM_Range,...,EDGE_AIR_BAG_PRESSURE_Variance,EDGE_AIR_BAG_PRESSURE_Minimum,EDGE_AIR_BAG_PRESSURE_Maximum,EDGE_AIR_BAG_PRESSURE_Range,EDGE_AIR_BAG_PRESSURE_Skewness,EDGE_AIR_BAG_PRESSURE_Kurtosis,EDGE_AIR_BAG_PRESSURE_25thPercentile,EDGE_AIR_BAG_PRESSURE_50thPercentile,EDGE_AIR_BAG_PRESSURE_75thPercentile,AVG_REMOVAL_RATE
0,-4226160404,0,4.0,10060.889356,10061.666667,3.866124,14.946915,10055.000000,10067.500000,12.500000,...,396.731174,0.0,57.575758,57.575758,-1.375061,0.079085,48.484848,48.484848,48.484848,60.44715
1,-4224160686,0,4.0,9614.945238,9615.000000,3.747940,14.047056,9609.166667,9621.666667,12.500000,...,362.786997,0.0,57.878788,57.878788,-1.532967,0.580493,48.484848,48.787879,48.787879,57.25230
2,-4224160678,1,4.0,9526.132959,9526.666667,3.834909,14.706529,9520.000000,9533.333333,13.333333,...,337.359492,0.0,57.878788,57.878788,-1.507054,0.669936,43.939394,44.242424,48.787879,66.98130
3,-4224160592,0,4.0,9332.545761,9332.916667,3.690949,13.623101,9326.666667,9339.166667,12.500000,...,354.062453,0.0,57.878788,57.878788,-1.393967,0.326083,43.939394,44.242424,48.787879,56.17860
4,-4222160444,1,4.0,3375.676944,3375.833333,4.013973,16.111976,3369.166667,3382.500000,13.333333,...,269.665701,0.0,57.878788,57.878788,-1.745680,1.540839,43.939394,43.939394,43.939394,60.87570
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419,4225773718,1,4.0,5485.459927,5485.833333,3.970619,15.765817,5479.166667,5492.500000,13.333333,...,295.897063,0.0,57.878788,57.878788,-1.585451,1.050168,43.939394,43.939394,44.242424,60.74265
420,4225773746,1,4.0,4513.427495,4514.166667,3.810169,14.517391,4507.500000,4520.000000,12.500000,...,304.413676,0.0,57.878788,57.878788,-1.523687,0.843243,43.939394,43.939394,44.242424,70.36635
421,4225773754,0,4.0,3535.279383,3535.833333,3.740019,13.987741,3529.166667,3541.666667,12.500000,...,384.479872,0.0,57.878788,57.878788,-1.465521,0.350295,48.484848,48.484848,48.787879,67.47135
422,4227773662,0,4.0,2905.799615,2907.500000,2.191255,4.801597,2901.666667,2907.500000,5.833333,...,594.035795,0.0,57.878788,57.878788,0.462397,-1.752804,0.000000,0.000000,48.484848,65.07495


In [28]:
training_set

Unnamed: 0,WAFER_ID,STAGE,CHAMBER,USAGE_OF_BACKING_FILM_Mean,USAGE_OF_BACKING_FILM_Median,USAGE_OF_BACKING_FILM_StdDev,USAGE_OF_BACKING_FILM_Variance,USAGE_OF_BACKING_FILM_Minimum,USAGE_OF_BACKING_FILM_Maximum,USAGE_OF_BACKING_FILM_Range,...,EDGE_AIR_BAG_PRESSURE_Variance,EDGE_AIR_BAG_PRESSURE_Minimum,EDGE_AIR_BAG_PRESSURE_Maximum,EDGE_AIR_BAG_PRESSURE_Range,EDGE_AIR_BAG_PRESSURE_Skewness,EDGE_AIR_BAG_PRESSURE_Kurtosis,EDGE_AIR_BAG_PRESSURE_25thPercentile,EDGE_AIR_BAG_PRESSURE_50thPercentile,EDGE_AIR_BAG_PRESSURE_75thPercentile,AVG_REMOVAL_RATE
0,-4230160598,0,4.0,890.069846,890.833333,3.709068,13.757185,884.166667,896.666667,12.500000,...,422.571307,0.0,57.878788,57.878788,-1.254628,-0.261978,48.484848,48.484848,48.484848,68.88180
1,-4230160594,1,4.0,1291.998698,1293.333333,4.324576,18.701959,1285.833333,1298.333333,12.500000,...,834.453487,0.0,106.363636,106.363636,-0.446917,-0.612414,43.939394,44.242424,70.000000,70.05330
2,-4230160436,1,4.0,3272.829619,3273.333333,3.423431,11.719881,3266.666667,3277.500000,10.833333,...,1867.577448,0.0,141.515152,141.515152,1.238519,0.891666,0.000000,43.939394,44.242424,54.30720
3,-4230160428,0,4.0,5922.780214,5924.166667,2.480098,6.150885,5918.333333,5925.000000,6.666667,...,587.820962,0.0,57.878788,57.878788,0.395049,-1.796705,0.000000,0.000000,48.484848,75.34995
4,-4230160424,0,4.0,4868.350291,4870.000000,2.159261,4.662410,4864.166667,4870.000000,5.833333,...,598.141525,0.0,57.878788,57.878788,0.430882,-1.787898,0.000000,0.000000,48.787879,78.33015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1976,4229773726,1,4.0,4366.857689,4367.500000,3.724011,13.868259,4360.833333,4373.333333,12.500000,...,381.257203,0.0,57.878788,57.878788,-1.258567,-0.097726,43.939394,43.939394,48.484848,76.78335
1977,4229773730,1,4.0,2492.143836,2492.500000,3.889432,15.127682,2485.833333,2499.166667,13.333333,...,334.507214,0.0,57.878788,57.878788,-1.321976,0.225655,43.939394,43.939394,43.939394,64.67670
1978,4229773746,0,4.0,2176.879179,2177.500000,3.760443,14.140932,2170.833333,2183.333333,12.500000,...,517.579888,0.0,60.909091,60.909091,-1.139250,-0.405617,48.484848,48.484848,60.606061,71.10945
1979,4229773746,1,4.0,2415.961187,2416.666667,3.844557,14.780617,2410.000000,2422.500000,12.500000,...,315.222355,0.0,57.878788,57.878788,-1.465598,0.658129,43.939394,43.939394,44.242424,65.95260


In [29]:
X = training_set.drop(['AVG_REMOVAL_RATE','CHAMBER'], axis=1)
y = training_set['AVG_REMOVAL_RATE']

correlation_threshold_low = 0.17
correlation_threshold_high = 0.81
correlations = X.corrwith(y)
selected_features = correlations[(abs(correlations) >= abs(correlation_threshold_low)) & (abs(correlations) <= abs(correlation_threshold_high))].index

print("Selected Features:")
print(len(selected_features))

Selected Features:
27


  c /= stddev[:, None]
  c /= stddev[None, :]


In [30]:
selected_features

Index(['PRESSURIZED_CHAMBER_PRESSURE_Maximum',
       'PRESSURIZED_CHAMBER_PRESSURE_Range',
       'PRESSURIZED_CHAMBER_PRESSURE_75thPercentile',
       'MAIN_OUTER_AIR_BAG_PRESSURE_75thPercentile',
       'CENTER_AIR_BAG_PRESSURE_75thPercentile',
       'RETAINER_RING_PRESSURE_StdDev', 'RETAINER_RING_PRESSURE_Maximum',
       'RETAINER_RING_PRESSURE_Range', 'RETAINER_RING_PRESSURE_Skewness',
       'RETAINER_RING_PRESSURE_75thPercentile',
       'RIPPLE_AIR_BAG_PRESSURE_75thPercentile', 'SLURRY_FLOW_LINE_A_Median',
       'SLURRY_FLOW_LINE_A_StdDev', 'SLURRY_FLOW_LINE_A_Maximum',
       'SLURRY_FLOW_LINE_A_Range', 'SLURRY_FLOW_LINE_A_50thPercentile',
       'SLURRY_FLOW_LINE_B_Median', 'SLURRY_FLOW_LINE_B_50thPercentile',
       'SLURRY_FLOW_LINE_B_75thPercentile', 'SLURRY_FLOW_LINE_C_Mean',
       'SLURRY_FLOW_LINE_C_StdDev', 'SLURRY_FLOW_LINE_C_75thPercentile',
       'WAFER_ROTATION_Mean', 'WAFER_ROTATION_Variance',
       'WAFER_ROTATION_75thPercentile', 'STAGE_ROTATION_75thPercen

In [31]:
ts2 = training_set[selected_features]
ts2['AVG_REMOVAL_RATE'] = training_set['AVG_REMOVAL_RATE']
ts2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ts2['AVG_REMOVAL_RATE'] = training_set['AVG_REMOVAL_RATE']


Unnamed: 0,PRESSURIZED_CHAMBER_PRESSURE_Maximum,PRESSURIZED_CHAMBER_PRESSURE_Range,PRESSURIZED_CHAMBER_PRESSURE_75thPercentile,MAIN_OUTER_AIR_BAG_PRESSURE_75thPercentile,CENTER_AIR_BAG_PRESSURE_75thPercentile,RETAINER_RING_PRESSURE_StdDev,RETAINER_RING_PRESSURE_Maximum,RETAINER_RING_PRESSURE_Range,RETAINER_RING_PRESSURE_Skewness,RETAINER_RING_PRESSURE_75thPercentile,...,SLURRY_FLOW_LINE_B_75thPercentile,SLURRY_FLOW_LINE_C_Mean,SLURRY_FLOW_LINE_C_StdDev,SLURRY_FLOW_LINE_C_75thPercentile,WAFER_ROTATION_Mean,WAFER_ROTATION_Variance,WAFER_ROTATION_75thPercentile,STAGE_ROTATION_75thPercentile,EDGE_AIR_BAG_PRESSURE_75thPercentile,AVG_REMOVAL_RATE
0,150.000000,150.000000,78.571429,270.0,72.1875,1533.048952,7702.5,7702.5,2.678173,1453.725,...,0.909091,322.161850,185.944789,445.2,14.553031,281.198986,34.651163,114.901316,48.484848,68.88180
1,150.000000,150.000000,146.190476,498.0,109.6875,1394.955877,8837.4,8837.4,2.505619,1942.200,...,0.909091,362.589063,152.388329,442.4,19.457667,280.083729,34.651163,0.000000,70.000000,70.05330
2,183.809524,183.809524,76.190476,258.0,65.9375,1390.495419,7952.1,7952.1,2.173991,1454.700,...,0.909091,269.962117,205.295506,442.4,18.248364,240.746679,34.651163,66.052632,44.242424,54.30720
3,150.000000,150.000000,78.571429,270.0,72.1875,1568.458014,8782.8,8782.8,2.923710,1446.900,...,0.909091,176.334503,209.483750,439.6,17.041344,259.450160,34.651163,131.052632,48.484848,75.34995
4,150.476190,150.476190,78.571429,270.0,72.1875,1397.011439,9055.8,9055.8,3.216288,1446.900,...,0.909091,171.500000,208.792693,436.8,14.621417,281.763899,34.651163,131.052632,48.787879,78.33015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1976,150.000000,150.000000,77.619048,270.0,72.1875,1578.070768,9500.4,9500.4,2.780095,1454.700,...,0.909091,320.563897,181.612588,439.6,14.664490,283.875853,34.651163,131.052632,48.484848,76.78335
1977,150.000000,150.000000,72.857143,258.0,65.9375,1312.422203,7133.1,7133.1,3.103147,1454.700,...,0.909091,328.443836,177.388491,439.6,13.922268,280.848635,34.651163,66.052632,43.939394,64.67670
1978,150.000000,150.000000,77.142857,348.0,101.8750,1296.870050,7133.1,7133.1,2.888540,1614.600,...,0.909091,342.041261,225.105822,560.0,14.347971,277.027938,34.651163,131.052632,60.606061,71.10945
1979,150.000000,150.000000,72.857143,258.0,65.9375,1243.915182,7959.9,7959.9,3.373389,1454.700,...,0.909091,336.161096,167.835034,434.0,14.042689,281.165688,34.651163,66.052632,44.242424,65.95260


In [32]:
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

X = ts2.drop('AVG_REMOVAL_RATE', axis=1)
y = ts2['AVG_REMOVAL_RATE']

# Fitness Function for Regression
def fitness_function(selected_features):
    X_subset = X.iloc[:, selected_features]
    model = RandomForestRegressor()
    model.fit(X_subset, y)
    predictions = model.predict(X_subset)
    mse = np.mean((predictions - y) ** 2)
    return mse

# Simulated Annealing Specific Functions
def neighbor(solution):
    """Generate a neighbor by flipping a random bit"""
    neighbor = solution.copy()
    index = random.randint(0, len(solution) - 1)
    neighbor[index] = 1 - neighbor[index]  # Flip bit
    return neighbor

def acceptance_probability(old_cost, new_cost, temperature):
    """Calculate acceptance probability"""
    if new_cost < old_cost:
        return 1.0
    else:
        return np.exp((old_cost - new_cost) / temperature)

def simulated_annealing(max_iterations=100, initial_temperature=100, cooling_rate=0.99):
    current_solution = [random.randint(0, 1) for _ in range(len(X.columns))]
    current_cost = fitness_function([i for i, bit in enumerate(current_solution) if bit == 1])
    temperature = initial_temperature
    
    for iteration in range(max_iterations):
        new_solution = neighbor(current_solution)
        new_cost = fitness_function([i for i, bit in enumerate(new_solution) if bit == 1])
        
        if acceptance_probability(current_cost, new_cost, temperature) > random.random():
            current_solution, current_cost = new_solution, new_cost
            
        temperature *= cooling_rate  # Cool down
        
        if iteration % 10 == 0:  # Print every 100 iterations
            print(f"Iteration {iteration}: Cost = {current_cost}, Temp = {temperature}")
    
    return current_solution

# Run Simulated Annealing
best_solution = simulated_annealing()
selected_features_final = [i for i, bit in enumerate(best_solution) if bit == 1]

# Evaluate Results
X_final_subset = X.iloc[:, selected_features_final]


Iteration 0: Cost = 12121.89878844346, Temp = 99.0
Iteration 10: Cost = 10968.415393305233, Temp = 89.53382542587164
Iteration 20: Cost = 10941.923932505859, Temp = 80.97278682212585
Iteration 30: Cost = 10631.328720901613, Temp = 73.23033696543976
Iteration 40: Cost = 10601.022043419338, Temp = 66.22820409839836
Iteration 50: Cost = 10601.022043419338, Temp = 59.89560064661612
Iteration 60: Cost = 10601.022043419338, Temp = 54.16850759668538
Iteration 70: Cost = 10601.022043419338, Temp = 48.98902730042051
Iteration 80: Cost = 10599.347566288303, Temp = 44.30479816261727
Iteration 90: Cost = 10552.07058765455, Temp = 40.06846529515408


In [33]:
X_final_subset

Unnamed: 0,PRESSURIZED_CHAMBER_PRESSURE_Maximum,PRESSURIZED_CHAMBER_PRESSURE_75thPercentile,CENTER_AIR_BAG_PRESSURE_75thPercentile,RETAINER_RING_PRESSURE_75thPercentile,RIPPLE_AIR_BAG_PRESSURE_75thPercentile,SLURRY_FLOW_LINE_A_Median,SLURRY_FLOW_LINE_A_50thPercentile,SLURRY_FLOW_LINE_B_50thPercentile,SLURRY_FLOW_LINE_B_75thPercentile,SLURRY_FLOW_LINE_C_Mean,WAFER_ROTATION_Mean,WAFER_ROTATION_Variance,STAGE_ROTATION_75thPercentile
0,150.000000,78.571429,72.1875,1453.725,9.954545,2.222222,2.222222,0.909091,0.909091,322.161850,14.553031,281.198986,114.901316
1,150.000000,146.190476,109.6875,1942.200,17.318182,2.222222,2.222222,0.909091,0.909091,362.589063,19.457667,280.083729,0.000000
2,183.809524,76.190476,65.9375,1454.700,10.045455,2.222222,2.222222,0.909091,0.909091,269.962117,18.248364,240.746679,66.052632
3,150.000000,78.571429,72.1875,1446.900,9.954545,2.222222,2.222222,0.909091,0.909091,176.334503,17.041344,259.450160,131.052632
4,150.476190,78.571429,72.1875,1446.900,9.954545,2.222222,2.222222,0.909091,0.909091,171.500000,14.621417,281.763899,131.052632
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1976,150.000000,77.619048,72.1875,1454.700,10.000000,2.222222,2.222222,0.909091,0.909091,320.563897,14.664490,283.875853,131.052632
1977,150.000000,72.857143,65.9375,1454.700,10.000000,2.222222,2.222222,0.909091,0.909091,328.443836,13.922268,280.848635,66.052632
1978,150.000000,77.142857,101.8750,1614.600,15.227273,2.222222,2.222222,0.909091,0.909091,342.041261,14.347971,277.027938,131.052632
1979,150.000000,72.857143,65.9375,1454.700,10.045455,2.222222,2.222222,0.909091,0.909091,336.161096,14.042689,281.165688,66.052632


In [34]:
output_data = pd.read_csv("data/CMP-" + "training" + "-removalrate.csv")
output_data


Unnamed: 0,WAFER_ID,STAGE,AVG_REMOVAL_RATE
0,-4224160600,A,61.65480
1,-4224160584,B,75.86415
2,-4224160580,B,71.90700
3,-4113511818,A,65.02230
4,-4113511814,A,58.27905
...,...,...,...
1976,33494136,A,72.76305
1977,33494140,B,85.26705
1978,33494166,A,73.33245
1979,35494162,A,74.61390


In [35]:
X_final_subset["AVG_REMOVAL_RATE"] = output_data["AVG_REMOVAL_RATE"]
X_final_subset

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_final_subset["AVG_REMOVAL_RATE"] = output_data["AVG_REMOVAL_RATE"]


Unnamed: 0,PRESSURIZED_CHAMBER_PRESSURE_Maximum,PRESSURIZED_CHAMBER_PRESSURE_75thPercentile,CENTER_AIR_BAG_PRESSURE_75thPercentile,RETAINER_RING_PRESSURE_75thPercentile,RIPPLE_AIR_BAG_PRESSURE_75thPercentile,SLURRY_FLOW_LINE_A_Median,SLURRY_FLOW_LINE_A_50thPercentile,SLURRY_FLOW_LINE_B_50thPercentile,SLURRY_FLOW_LINE_B_75thPercentile,SLURRY_FLOW_LINE_C_Mean,WAFER_ROTATION_Mean,WAFER_ROTATION_Variance,STAGE_ROTATION_75thPercentile,AVG_REMOVAL_RATE
0,150.000000,78.571429,72.1875,1453.725,9.954545,2.222222,2.222222,0.909091,0.909091,322.161850,14.553031,281.198986,114.901316,61.65480
1,150.000000,146.190476,109.6875,1942.200,17.318182,2.222222,2.222222,0.909091,0.909091,362.589063,19.457667,280.083729,0.000000,75.86415
2,183.809524,76.190476,65.9375,1454.700,10.045455,2.222222,2.222222,0.909091,0.909091,269.962117,18.248364,240.746679,66.052632,71.90700
3,150.000000,78.571429,72.1875,1446.900,9.954545,2.222222,2.222222,0.909091,0.909091,176.334503,17.041344,259.450160,131.052632,65.02230
4,150.476190,78.571429,72.1875,1446.900,9.954545,2.222222,2.222222,0.909091,0.909091,171.500000,14.621417,281.763899,131.052632,58.27905
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1976,150.000000,77.619048,72.1875,1454.700,10.000000,2.222222,2.222222,0.909091,0.909091,320.563897,14.664490,283.875853,131.052632,72.76305
1977,150.000000,72.857143,65.9375,1454.700,10.000000,2.222222,2.222222,0.909091,0.909091,328.443836,13.922268,280.848635,66.052632,85.26705
1978,150.000000,77.142857,101.8750,1614.600,15.227273,2.222222,2.222222,0.909091,0.909091,342.041261,14.347971,277.027938,131.052632,73.33245
1979,150.000000,72.857143,65.9375,1454.700,10.045455,2.222222,2.222222,0.909091,0.909091,336.161096,14.042689,281.165688,66.052632,74.61390


In [36]:
X_final_subset.to_csv("AliDara.csv", index=False)

In [37]:
training_inputs = training_set[X_final_subset.columns].values
training_outputs = training_set['AVG_REMOVAL_RATE'].values

# Create test inputs and outputs
test_inputs = test_set[X_final_subset.columns].values
test_outputs = test_set['AVG_REMOVAL_RATE'].values

In [43]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, Flatten, Dense
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

# Initialize two MinMaxScaler: one for inputs and one for outputs
inputs_scaler = MinMaxScaler()
outputs_scaler = MinMaxScaler()

# Fit the scalers to the training data and transform both training and test data
# Scaling inputs
scaled_training_inputs = inputs_scaler.fit_transform(training_inputs)
scaled_test_inputs = inputs_scaler.transform(test_inputs)

# Scaling outputs. Reshape is used because fit_transform expects 2D array
scaled_training_outputs = outputs_scaler.fit_transform(training_outputs.reshape(-1, 1)).flatten()
scaled_test_outputs = outputs_scaler.transform(test_outputs.reshape(-1, 1)).flatten()
def calculate_metrics(model, model_name, scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler, is_cnn=False, epochs=10, batch_size=32):
    # Fit the model
    if is_cnn:
        scaled_training_inputs_cnn = scaled_training_inputs.reshape((scaled_training_inputs.shape[0], scaled_training_inputs.shape[1], 1))
        scaled_test_inputs_cnn = scaled_test_inputs.reshape((scaled_test_inputs.shape[0], scaled_test_inputs.shape[1], 1))
        
        model.fit(scaled_training_inputs_cnn, scaled_training_outputs, epochs=epochs, batch_size=batch_size, verbose=1)
        predictions = model.predict(scaled_test_inputs_cnn).flatten()
    else:
        model.fit(scaled_training_inputs, scaled_training_outputs)
        predictions = model.predict(scaled_test_inputs)

    # Inverse transform the predictions and the actual values to get back to the original scale
    predictions_inv = outputs_scaler.inverse_transform(predictions.reshape(-1, 1)).flatten()
    test_outputs_inv = outputs_scaler.inverse_transform(scaled_test_outputs.reshape(-1, 1)).flatten()

    # Calculate errors for the model
    errors = predictions_inv - test_outputs_inv
    relative_errors = errors / np.maximum(np.abs(test_outputs_inv), 1e-8)

    # Calculate additional metrics for the model
    metrics = {
        'Mean of Error': np.mean(errors),
        'Max of Error': np.max(errors),
        'MAE': mean_absolute_error(test_outputs_inv, predictions_inv),
        'Mean Absolute Percentage Error': mean_absolute_percentage_error(test_outputs_inv, predictions_inv),
        'Max Absolute Percentage Error': np.max(np.abs(errors / test_outputs_inv)),
        'MSE': mean_squared_error(test_outputs_inv, predictions_inv),
    }

    return {'Model': model_name, **metrics}

# Example usage:
# Ridge Regression
model_ridge = Ridge()
metrics_ridge = calculate_metrics(model_ridge, 'Ridge Regression', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler)


# SVR
model_svr = SVR()
metrics_svr = calculate_metrics(model_svr, 'SVR', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler)
# Lasso
model_lasso = Lasso()
metrics_lasso = calculate_metrics(model_lasso, 'Lasso', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler)
# Decision Tree
model_dt = DecisionTreeRegressor()
metrics_dt = calculate_metrics(model_dt, 'Decision Tree', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler)
# Gradient Boosting
model_gbr = GradientBoostingRegressor()
metrics_gbr = calculate_metrics(model_gbr, 'Gradient Boosting', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler)
# Elastic Net
model_en = ElasticNet()
metrics_en = calculate_metrics(model_en, 'Elastic Net', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler)
# CNN
model_cnn = Sequential()
model_cnn.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(scaled_training_inputs.shape[1], 1)))
model_cnn.add(Flatten())
model_cnn.add(Dense(64, activation='relu'))
model_cnn.add(Dense(1))

# Compile the model
model_cnn.compile(optimizer='adam', loss='mean_squared_error')

metrics_cnn = calculate_metrics(model_cnn, 'CNN', scaled_training_inputs, scaled_training_outputs, scaled_test_inputs, scaled_test_outputs, outputs_scaler, is_cnn=True, epochs=10, batch_size=32)

# Add Ridge Regression metrics to the results DataFrame
results_df = pd.DataFrame([metrics_ridge, metrics_svr, metrics_lasso, metrics_dt, metrics_gbr, metrics_en])  # Add metrics_ridge here
results_df.set_index('Model', inplace=True)



Epoch 1/10


  super().__init__(


[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 648us/step - loss: 0.0045 
Epoch 2/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 593us/step - loss: 0.0012  
Epoch 3/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 580us/step - loss: 0.0016  
Epoch 4/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 578us/step - loss: 6.1982e-04
Epoch 5/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 717us/step - loss: 2.5548e-04
Epoch 6/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 768us/step - loss: 2.2220e-04
Epoch 7/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 671us/step - loss: 6.7066e-05
Epoch 8/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 625us/step - loss: 2.0466e-05
Epoch 9/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 629us/step - loss: 4.5029e-05
Epoch 10/10
[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[

In [44]:
# Display the results DataFrame
results_df

Unnamed: 0_level_0,Mean of Error,Max of Error,MAE,Mean Absolute Percentage Error,Max Absolute Percentage Error,MSE
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ridge Regression,1.472728,31.779296,4.35101,0.044963,0.282306,44.837269
SVR,334.04347,536.426021,334.04347,3.967763,7.382702,113262.846397
Lasso,8.695917,44.129145,27.14883,0.307886,0.809672,949.473107
Decision Tree,0.002798,1.09395,0.040416,0.000448,0.011075,0.010272
Gradient Boosting,0.006112,1.675797,0.183654,0.002211,0.016966,0.066156
Elastic Net,8.695917,44.129145,27.14883,0.307886,0.809672,949.473107
