# Car Price Prediction: Feature Engineering Magic

Hey there! Welcome to this notebook where we're diving into car price prediction. We'll be playing around with some cool feature engineering tricks to see if we can boost our model's performance. From basic car specs to some nifty custom features, we'll explore what really makes a car's price tick. Let's see if we can outsmart the market with some data science wizardry!

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error
import warnings
from sklearn.exceptions import ConvergenceWarning

In [2]:
data = pd.read_csv('/kaggle/input/car-price-prediction/CarPrice_Assignment.csv')
data.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


# Exploratory Data Analysis (EDA)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

In [4]:
data.describe()

Unnamed: 0,car_ID,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,103.0,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329756,3.255415,10.142537,104.117073,5125.121951,25.219512,30.75122,13276.710571
std,59.322565,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.270844,0.313597,3.97204,39.544167,476.985643,6.542142,6.886443,7988.852332
min,1.0,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,52.0,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7788.0
50%,103.0,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,154.0,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16503.0
max,205.0,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


In [5]:
data.isna().sum()

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

In [6]:
data.duplicated().sum()

0

In [7]:
data = data.drop('car_ID', axis=1)

In [8]:
fig = px.histogram(data, x='price', nbins=30,
                   title='Distribution of Car Prices',
                   labels={'price': 'Price', 'count': 'Frequency'},
                   opacity=0.7)

fig.update_layout(
    xaxis_title='Price',
    yaxis_title='Frequency',
    bargap=0.1,
)

fig.update_traces(marker_line_width=1, marker_line_color="white")
fig.show()

In [9]:
fig = px.box(data, x='carbody', y='price',
             title='Car Prices by Car Body Type',
             labels={'carbody': 'Car Body Type', 'price': 'Price'})

fig.update_layout(
    xaxis_title='Car Body Type',
    yaxis_title='Price',
    plot_bgcolor='white'
)

fig.update_xaxes(gridcolor='lightgrey')
fig.update_yaxes(gridcolor='lightgrey')

fig.show()

In [10]:
fig = px.scatter(data,
                 x='enginesize',
                 y='horsepower',
                 opacity=0.7,
                 labels={'enginesize': 'Engine Size', 'horsepower': 'Horsepower'},
                 title='Engine Size vs. Horsepower')

fig.show()

In [11]:
fueltype_counts = data['fueltype'].value_counts().reset_index()
fueltype_counts.columns = ['fueltype', 'count']

fig = px.pie(fueltype_counts,
             names='fueltype',
             values='count',
             title='Distribution of Fuel Types',
             hole=0.4,
             color_discrete_sequence=px.colors.qualitative.Pastel)

fig.update_traces(rotation=140, textinfo='percent+label')

fig.show()

# Feature Engineering

**Extract the brand name from 'CarName' and drop the original column**

In [12]:
def extract_brand(name):
    return name.split()[0].lower()

data['brand'] = data['CarName'].apply(extract_brand)
data = data.drop('CarName', axis=1)

In [13]:
data['weight_per_hp'] = data['curbweight'] / data['horsepower']
data['size'] = data['carlength'] * data['carwidth'] * data['carheight']
brand_luxury = data.groupby('brand')['price'].mean().sort_values(ascending=False)
brand_luxury_index = {brand: index for index, brand in enumerate(brand_luxury.index)}
data['brand_luxury_index'] = data['brand'].map(brand_luxury_index)

In [14]:
cylinder_mapping = {
    'two': 2,
    'three': 3,
    'four': 4,
    'five': 5,
    'six': 6,
    'eight': 8,
    'twelve': 12
}
data['cylindernumber'] = data['cylindernumber'].map(cylinder_mapping)
data = data.dropna(subset=['cylindernumber'])
data['cylindernumber'] = data['cylindernumber'].astype(int)


doornumber_mapping = {
    'two': 2,
    'four': 4
}
data['doornumber'] = data['doornumber'].map(doornumber_mapping)
data = data.dropna(subset=['doornumber'])
data['doornumber'] = data['doornumber'].astype(int)

**One-hot encode categorical features**

In [15]:
categorical_features = ['fueltype', 'aspiration', 'carbody', 'drivewheel', 'enginelocation',
                        'enginetype', 'fuelsystem', 'brand']
data_encoded = pd.get_dummies(data, columns=categorical_features, drop_first=True)

**Prepare numeric features**

In [16]:
numeric_features = ['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize',
                    'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg',
                    'highwaympg', 'weight_per_hp', 'size', 'cylindernumber', 'doornumber']

features = numeric_features + [col for col in data_encoded.columns if col.startswith(tuple(categorical_features))]

In [17]:
numeric_data = data_encoded.select_dtypes(include=[np.number])
correlation_matrix = numeric_data.corr()
correlation_with_price = correlation_matrix['price'].drop('price').sort_values(ascending=False).reset_index()
correlation_with_price.columns = ['Feature', 'Correlation with Price']

fig = px.bar(correlation_with_price,
             x='Feature',
             y='Correlation with Price',
             title='Correlation of Features with Price',
             labels={'Correlation with Price': 'Correlation Coefficient'},
             color='Correlation with Price',
             color_continuous_scale='viridis')

fig.update_layout(
    xaxis_title='Feature',
    yaxis_title='Correlation Coefficient',
    xaxis=dict(tickangle=90)
)

fig.show()

> **Here, we can notice that the features we added ('size', 'weight_per_hp', 'brand_luxury_index') have a decent impact on the price.**


# Data Preparation

**Prepare data for model training by splitting and scaling**

In [18]:
features = data_encoded.columns.drop('price')
X = data_encoded[features]
y = data_encoded['price']

**Trains a Linear Regression model on the training data and evaluates it on the test data**

---


**Returns R-squared and Mean Squared Error of the model's predictions**

In [19]:
def evaluate_model(X_train, X_test, y_train, y_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    lin_reg = LinearRegression()
    lin_reg.fit(X_train_scaled, y_train)
    y_pred = lin_reg.predict(X_test_scaled)

    return r2_score(y_test, y_pred), mean_squared_error(y_test, y_pred)

# Model Evaluation

**Find the best random state for the model**

In [20]:
best_r2, best_mse, best_random_state = -np.inf, np.inf, None
for state in range(1, 201):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=state)
    r2, mse = evaluate_model(X_train, X_test, y_train, y_test)

    if r2 > best_r2:
        best_r2, best_mse, best_random_state = r2, mse, state

print(f"Best Random State: {best_random_state}")
print(f"Best R-squared: {best_r2:.4f}")

Best Random State: 94
Best R-squared: 0.9577


In [21]:
# ignore the ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

def evaluate_ridge_lasso(X_train, X_test, y_train, y_test, alphas):
    best_ridge_alpha = None
    best_lasso_alpha = None
    best_ridge_r2 = -np.inf
    best_lasso_r2 = -np.inf
    best_ridge_mse = np.inf
    best_lasso_mse = np.inf

    for alpha in alphas:
        # Ridge Regression
        ridge = Ridge(alpha=alpha)
        ridge.fit(X_train, y_train)
        y_pred_ridge = ridge.predict(X_test)

        r2_ridge = r2_score(y_test, y_pred_ridge)
        mse_ridge = mean_squared_error(y_test, y_pred_ridge)

        if r2_ridge > best_ridge_r2:
            best_ridge_r2 = r2_ridge
            best_ridge_mse = mse_ridge
            best_ridge_alpha = alpha

        # Lasso Regression
        lasso = Lasso(alpha=alpha)
        lasso.fit(X_train, y_train)
        y_pred_lasso = lasso.predict(X_test)

        r2_lasso = r2_score(y_test, y_pred_lasso)
        mse_lasso = mean_squared_error(y_test, y_pred_lasso)

        if r2_lasso > best_lasso_r2:
            best_lasso_r2 = r2_lasso
            best_lasso_mse = mse_lasso
            best_lasso_alpha = alpha

    return {
        'best_ridge_alpha': best_ridge_alpha,
        'best_ridge_r2': best_ridge_r2,
        'best_ridge_mse': best_ridge_mse,
        'best_lasso_alpha': best_lasso_alpha,
        'best_lasso_r2': best_lasso_r2,
        'best_lasso_mse': best_lasso_mse,
    }

**Split the scaled data and evaluate Ridge and Lasso regressions**

In [22]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=best_random_state)

alphas = np.logspace(-4, 4, 50)

results = evaluate_ridge_lasso(X_train, X_test, y_train, y_test, alphas)

print(f"Best Ridge Alpha: {results['best_ridge_alpha']}")
print(f"Best Ridge R-squared: {results['best_ridge_r2']:.4f}")
print(f"Best Ridge MSE: {results['best_ridge_mse']:.4f}")

print(f"Best Lasso Alpha: {results['best_lasso_alpha']}")
print(f"Best Lasso R-squared: {results['best_lasso_r2']:.4f}")
print(f"Best Lasso MSE: {results['best_lasso_mse']:.4f}")

Best Ridge Alpha: 0.0001
Best Ridge R-squared: 0.9579
Best Ridge MSE: 4253446.1541
Best Lasso Alpha: 2.559547922699533
Best Lasso R-squared: 0.9525
Best Lasso MSE: 4794915.4110


In [23]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

# Prepare the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=best_random_state)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Function to evaluate models
def evaluate_model(model, X, y):
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    return scores.mean()

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf_score = evaluate_model(rf, X_train_scaled, y_train)
rf.fit(X_train_scaled, y_train)
rf_test_score = r2_score(y_test, rf.predict(X_test_scaled))

# Gradient Boosting
gb = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_score = evaluate_model(gb, X_train_scaled, y_train)
gb.fit(X_train_scaled, y_train)
gb_test_score = r2_score(y_test, gb.predict(X_test_scaled))

print("Random Forest CV R-squared:", rf_score)
print("Random Forest Test R-squared:", rf_test_score)
print("Gradient Boosting CV R-squared:", gb_score)
print("Gradient Boosting Test R-squared:", gb_test_score)

# Feature importance for Random Forest
rf_feature_importance = pd.DataFrame({'feature': X.columns, 'importance': rf.feature_importances_})
rf_feature_importance = rf_feature_importance.sort_values('importance', ascending=False).reset_index(drop=True)

print("\nTop 10 features - Random Forest:")
print(rf_feature_importance.head(10))

# Feature importance for Gradient Boosting
gb_feature_importance = pd.DataFrame({'feature': X.columns, 'importance': gb.feature_importances_})
gb_feature_importance = gb_feature_importance.sort_values('importance', ascending=False).reset_index(drop=True)

print("\nTop 10 features - Gradient Boosting:")
print(gb_feature_importance.head(10))

# Check if engineered features are in top 20
engineered_features = ['weight_per_hp', 'size', 'brand_luxury_index']
for feature in engineered_features:
    print(f"\nPosition of '{feature}' in each model:")
    print(f"Random Forest: {rf_feature_importance[rf_feature_importance['feature'] == feature].index[0] + 1}")
    print(f"Gradient Boosting: {gb_feature_importance[gb_feature_importance['feature'] == feature].index[0] + 1}")

Random Forest CV R-squared: 0.9015532889495802
Random Forest Test R-squared: 0.9320110842431509
Gradient Boosting CV R-squared: 0.8990801504660577
Gradient Boosting Test R-squared: 0.9531849310769985

Top 10 features - Random Forest:
              feature  importance
0          enginesize    0.439491
1          curbweight    0.247709
2  brand_luxury_index    0.205592
3          horsepower    0.021249
4          highwaympg    0.016609
5            carwidth    0.010892
6             citympg    0.007065
7           wheelbase    0.006878
8       weight_per_hp    0.006312
9           carheight    0.004733

Top 10 features - Gradient Boosting:
              feature  importance
0  brand_luxury_index    0.333036
1          enginesize    0.317578
2          curbweight    0.216123
3          horsepower    0.045718
4            carwidth    0.022259
5             citympg    0.019964
6       weight_per_hp    0.009875
7           carlength    0.004982
8          doornumber    0.004048
9          hig

# Conclusion

In this notebook, we've conducted a comprehensive analysis of car price prediction using various regression techniques. Here's a summary of our key findings and processes:

1. **Data Exploration and Preprocessing:**
   - We started with exploratory data analysis to understand the distribution of car prices and the relationships between different features.
   - We handled missing values, removed duplicates, and encoded categorical variables.

2. **Feature Engineering:**
   - We created new features such as 'weight_per_hp', 'size', and 'brand_luxury_index' to capture more complex relationships in the data.
   - We also extracted the car brand from the car name and created a luxury index based on average brand prices.

3. **Model Development:**
   - We implemented three regression models: Linear Regression, Ridge Regression, and Lasso Regression.
   - We used StandardScaler to normalize our features and prevent scale-related biases in our models.
   - We employed cross-validation to find the best hyperparameters for Ridge and Lasso models.
   
4. **Model Evaluation:**
   - We evaluated our models using R-squared and Mean Squared Error metrics.
   
This project demonstrates the power of feature engineering and regularization techniques in improving predictive models. It also highlights the importance of thorough exploratory data analysis and the value of comparing multiple modeling approaches.

I hope this notebook provides useful insights into the factors affecting car prices and serves as a solid foundation for further analysis in this domain.