# **03 Feature Build**

In [None]:
import pandas as pd
import numpy as np
import os

# Define the file paths
path = '/lakehouse/default/Files/data/processed/merged_vehicle_data.csv'

# Load the dataset into pandas dataframes
df = pd.read_csv(path)

StatementMeta(, , , Waiting, )

The provided code performs various data manipulation and feature engineering tasks on a dataframe `df`. Here's a breakdown of the operations and the significance of each newly engineered feature:

### Drop unnecessary columns
Columns such as `'MAKE_df1', 'MODEL_df1', 'YEAR_df1', 'Combined_Key_df1', 'DF2_Index', 'MAKE_df2', 'MODEL_df2', 'MYR', 'FILE_YEAR', 'YEAR_df2', 'Combined_Key_df2'` are removed from the dataframe. These columns are considered unnecessary for the analysis, possibly because they contain redundant, irrelevant, or duplicate information.

### Interpret 'WDIST' to extract Front and Rear weight distribution
The `'WDIST'` column, which contains information about the vehicle's weight distribution between the front and rear, is split into two new columns `'WDF'` and `'WDR'`. These represent the weight distribution in the front and rear of the vehicle, respectively, and are converted to `float` type for numerical operations.

### Building new features

- **Vehicle Volume (VV):** Calculated as `OL x OW x OH`. This feature represents the overall volume of the vehicle, which might indicate energy consumption efficiency due to factors like air resistance and mass.

- **Surface Area (SA):** Calculated as `2 * (OL x OW + OL x OH + OW x OH)`. It approximates the total surface area of the vehicle, which could influence energy consumption through air resistance.

- **Aspect Ratio (AR):** The ratio of `OW` to `OH`, providing insights into the vehicle's aerodynamic profile and potentially its energy efficiency.

- **Wheelbase to Length Ratio (WBLR):** Represents the ratio of the wheelbase `WB` to the overall length `OL`, indicating vehicle stability and aerodynamics.

- **Frontal Area (FA):** Calculated as `OW x OH`, directly impacting the vehicle's air resistance as it moves forward.

- **Track Width Difference (TWD):** The absolute difference between front and rear track widths, indicating vehicle stability which might affect energy efficiency.

- **Overhang Ratio Front (ORF)** and **Rear (ORR):** Calculated as `F1/WB` and `G1/WB` respectively, showing the proportion of the vehicle's overhang to its wheelbase, affecting aerodynamics and stability.

- **Glass Height to Body Height Ratio (GHBHR):** `C1/OH`, indicating the proportion of the vehicle height made up by the side glass, which could influence aerodynamics.

- **Underbody Clearance (UBC):** Represents the clearance under the vehicle, affecting aerodynamics at high speeds.

- **Cabin Width to Vehicle Width Ratio (CWVWR):** `E1/OW`, showing how much of the vehicle's width is occupied by the cabin.

- **Curb Weight to Volume Ratio (CWVR):** `CW/(OL x OW x OH)`, indicating how the vehicle's mass is distributed over its volume, which could impact energy efficiency.

- **Total Overhang (TO):** Sum of front and rear overhangs (`F1 + G1`), potentially affecting vehicle dynamics and energy consumption.

- **Average Track Width (ATW):** The average of the front and rear track widths, providing an overview of vehicle stability.

- **Body to Wheelbase Ratio (BWR):** `(A1 + B1)/WB`, indicating the proportion of the vehicle's body length to its wheelbase, affecting dynamics.

- **Curb Weight to Length Ratio (CWLR):** `CW/OL`, showing how weight is distributed relative to the vehicle's length.

- **Frontal Aspect Ratio (FAR):** `OW/C1`, providing insights into the vehicle's aerodynamics by comparing its width to the height of its glass area.

- **Weight Distribution Front to Rear Ratio (WDFR_RATIO):** A ratio indicating the balance of weight distribution between the front and rear of the vehicle.

- **Combined Weight Distribution Impact (CW_WD_IMPACT):** Combines the weight distribution factors with the curb weight to assess the overall impact on vehicle dynamics and efficiency.


In [None]:
# Drop unnecessary columns
df = df.drop(['MAKE_df1', 'MODEL_df1', 'YEAR_df1', 'Combined_Key_df1', 'DF2_Index', 
              'MAKE_df2', 'MODEL_df2', 'MYR', 'FILE_YEAR', 'YEAR_df2', 'Combined_Key_df2'], axis=1)

# Interpret 'WDIST' to extract Front and Rear weight distribution
df[['WDF', 'WDR']] = df['WDIST'].str.split('/', expand=True).astype(float)

## Building new features 
# Vehicle Volume
df['VV'] = df['OL'] * df['OW'] * df['OH']
# Surface Area
df['SA'] = 2 * (df['OL'] * df['OW'] + df['OL'] * df['OH'] + df['OW'] * df['OH'])
# Aspect Ratio: 
df['AR'] = df['OW'] / df['OH']
# Wheelbase to Length Ratio: 
df['WBLR'] = df['WB'] / df['OL']
# Frontal Area: 
df['FA'] = df['OW'] * df['OH']
# Track Width Difference: 
df['TWD'] = (df['TWF'] - df['TWR']).abs()
# Overhang Ratio Front: 
df['ORF'] = df['F1'] / df['WB']
# Overhang Ratio Rear: 
df['ORR'] = df['G1'] / df['WB']
# Glass Height to Body Height Ratio: 
df['GHBHR'] = df['C1'] / df['OH']
# Underbody Clearance: 
df['UBC'] = df['D1']
# Cabin Width to Vehicle Width Ratio: 
df['CWVWR'] = df['E1'] / df['OW']
# Curb Weight to Volume Ratio: 
df['CWVR'] = df['CW'] / df['VV']
# Total Overhang: 
df['TO'] = df['F1'] + df['G1']
# Average Track Width: 
df['ATW'] = (df['TWF'] + df['TWR']) / 2
# Body to Wheelbase Ratio: 
df['BWR'] = (df['A1'] + df['B1']) / df['WB']
# Curb Weight to Length Ratio: 
df['CWLR'] = df['CW'] / df['OL']
# Frontal Aspect Ratio: 
df['FAR'] = df['OW'] / df['C1']
# Weight Distribution Front to Rear Ratio: 
df['WDFR_RATIO'] = df['WDF'] / df['WDR']
# Combined Weight Distribution Impact: 
df['CW_WD_IMPACT'] = (df['WDF'] * df['CW']) + (df['WDR'] * df['CW'])

StatementMeta(, , , Waiting, )

## Save "Model-Ready" data

In [None]:
df.to_csv('/lakehouse/default/Files/data/processed/model_ready_data.csv', index=False)


StatementMeta(, , , Waiting, )