# Feature Engineering
## Objectives
* Encode categorical variables (Fuel_Type, Seller_Type, Transmission).
* Scale numerical features for ML.

## Outputs
* Processed dataset saved as 'processed_car_data.csv'.

## Additional Comments
* Use one-hot encoding for categorical variables and StandardScaler for numerical features.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

In [43]:


# Load dataset
df = pd.read_csv('../datasets/car_dataset_cleaned.csv')

# Display basic info
print('Dataset Info:')



Dataset Info:


In [22]:
df.columns

Index(['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type',
       'transmission', 'owner', 'mileage(km/ltr/kg)', 'engine', 'max_power',
       'seats'],
      dtype='object')

In [23]:
# Define features and target
X = df.drop(columns=['selling_price'])
y = df['selling_price']

In [24]:
X

Unnamed: 0,name,year,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,145500,Diesel,Individual,Manual,First Owner,23.40,1248.0,74,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,120000,Diesel,Individual,Manual,Second Owner,21.14,1498.0,103.52,5.0
2,Honda City 2017-2020 EXi,2006,140000,Petrol,Individual,Manual,Third Owner,17.70,1497.0,78,5.0
3,Hyundai i20 Sportz Diesel,2010,127000,Diesel,Individual,Manual,First Owner,23.00,1396.0,90,5.0
4,Maruti Swift VXI BSIII,2007,120000,Petrol,Individual,Manual,First Owner,16.10,1298.0,88.2,5.0
...,...,...,...,...,...,...,...,...,...,...,...
7902,Hyundai i20 Magna,2013,110000,Petrol,Individual,Manual,First Owner,18.50,1197.0,82.85,5.0
7903,Hyundai Verna CRDi SX,2007,119000,Diesel,Individual,Manual,Fourth & Above Owner,16.80,1493.0,110,5.0
7904,Maruti Swift Dzire ZDi,2009,120000,Diesel,Individual,Manual,First Owner,19.30,1248.0,73.9,5.0
7905,Tata Indigo CR4,2013,25000,Diesel,Individual,Manual,First Owner,23.57,1396.0,70,5.0


In [25]:
y

0       450000
1       370000
2       158000
3       225000
4       130000
         ...  
7902    320000
7903    135000
7904    382000
7905    290000
7906    290000
Name: selling_price, Length: 7907, dtype: int64

In [26]:
# select numerical and categorical data
# List of categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(exclude=['object']).columns.tolist()


In [27]:
# preprocess to adjust the data according to ML model
# Preprocessing
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
], remainder='passthrough')  # Pass through numerical columns


In [28]:
# make structure of model
# Model pipeline
model = Pipeline([
    ('preprocess', preprocessor),
    ('regressor', RandomForestRegressor())
])


In [29]:
# Train/test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [30]:
# fit model 
model.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [None]:
X_test.iloc[0:1]

Unnamed: 0,name,year,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats
3641,Maruti Swift VDI,2014,68000,Diesel,Dealer,Manual,First Owner,22.9,1248.0,74,5.0


In [41]:
# Example prediction
predictions = model.predict(X_test.iloc[0:1])
print("Predictions:", predictions)

Predictions: [510899.91]
