## ⏬ **Import Modules & Set Up Environment**

*Importing necessary libraries*

In [1]:
import pandas as pd  # For working with dataframes and performing data manipulation


*Setting up interactive environment*

In [2]:
# Importing additional functionality from IPython
from IPython.core.interactiveshell import InteractiveShell

# Setting an IPython display option to show all outputs in a single cell
# By default, Jupyter only shows the result of the last line in a cell.
# This line changes that behavior to display the output of all statements in the cell.
InteractiveShell.ast_node_interactivity = "all"


## 📂 **Load Dataset**

In [3]:
# Read data
df = pd.read_csv("../data/cleaned_car_data.csv")


In [4]:
# See data
df.shape
df


(2494, 20)

Unnamed: 0,Brand,Type,Reg_year,Reg_month,Days_since_registration,Coe_left,Depreciation,Mileage,Road_Tax,Dereg_Value,COE,Engine_Capacity,Curb_Weight,Manufactured,Transmission,OMV,ARF,Power,Number_of_Owners,Price
0,Honda,Suv,2015,10,3395,1587,10310,50000,682,31237,56001,1496,1190,2015,Auto,19775,9775,96,2,49800
1,Suzuki,Hatchback,2007,12,6242,201,8210,203000,1030,6656,21349,1586,1060,2007,Manual,12154,13370,92,2,12800
2,Porsche,Sports Car,2017,7,2751,2202,34200,21000,1200,106829,50110,1988,1365,2017,Auto,71979,101563,220,1,259988
3,Hyundai,Mid-Sized Sedan,2014,11,3729,1252,11010,35000,738,31339,64900,1591,1292,2014,Auto,13856,13856,97,1,44800
4,Kia,Mid-Sized Sedan,2019,7,2036,2947,9450,21200,738,38416,30009,1591,1287,2018,Auto,18894,18894,93,1,85800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2489,Bmw,Luxury Sedan,2018,2,2533,2445,17990,55000,1210,68148,39000,1998,1530,2017,Auto,45675,55945,135,2,148800
2490,Subaru,Suv,2010,4,5415,1300,7500,172892,1447,14045,19657,1994,1485,2010,Auto,19090,19090,110,3,26800
2491,Mercedes-Benz,Luxury Sedan,2013,12,4058,923,15240,127000,740,35772,73989,1595,1485,2013,Auto,29525,28335,115,4,52828
2492,Mazda,Mid-Sized Sedan,2017,12,2610,2370,9340,40200,682,35419,42801,1496,1310,2017,Auto,15108,10108,88,3,65800


## 🧺 **Feature Selection**

### 🧺 **Remove strongly correlated features**
- See Heatmap and Pairplot
    - Remove all those with 0.9 correlation and above

- `Engine_Capacity` and `Road_Tax` have a strong positive linear relationship
    - Remove Road Tax, because people probably care more about engine capacity and can provide details on preference, rather than preference on road tax
- `Days_since_registration` and `Reg_year` have a strong negative linear relationship
    - Remove days since registration, because people will ask for the registration year more frequently rather than how long since registration
- `OMV` and `ARF` have a positive linear relationship
    - Remove ARF, because people probably will not have figures for this

In [5]:
# Drop the columns
df = df.drop(columns=["Road_Tax", "Days_since_registration", "ARF"])

# Display the updated DataFrame
df.shape
df.head(1)


(2494, 17)

Unnamed: 0,Brand,Type,Reg_year,Reg_month,Coe_left,Depreciation,Mileage,Dereg_Value,COE,Engine_Capacity,Curb_Weight,Manufactured,Transmission,OMV,Power,Number_of_Owners,Price
0,Honda,Suv,2015,10,1587,10310,50000,31237,56001,1496,1190,2015,Auto,19775,96,2,49800


### 🧺 **Remove features based on domain knowledge**
- Remove `Reg_month`
    - Month of registration might not provide meaningful insights unless there’s a seasonal effect.
    -  This is not likely the case as cars are a big ticket item and the season will not be a huge consideration for potential buyers

In [6]:
# Drop the columns
df = df.drop(columns=["Reg_month",])

# Display the updated DataFrame
df.shape
df.head(1)

(2494, 16)

Unnamed: 0,Brand,Type,Reg_year,Coe_left,Depreciation,Mileage,Dereg_Value,COE,Engine_Capacity,Curb_Weight,Manufactured,Transmission,OMV,Power,Number_of_Owners,Price
0,Honda,Suv,2015,1587,10310,50000,31237,56001,1496,1190,2015,Auto,19775,96,2,49800


## 📬 **Save Final Dataset**

In [7]:
df.to_csv('../data/final_car_data.csv', index=False)


# ----------------------------------- END -----------------------------------