What drives the price of a car?

Business Understanding:

Business Perspective: Our mission is to identify the crucial factors that influence the pricing of used cars. In the context of the CRISP-DM framework, we must translate this business challenge into a precise data task.

Data Task Definition: Our primary objective is to leverage data to address the following questions:

    Preference for Car Origin: Determine whether customers exhibit a preference for German or Japanese cars when purchasing used vehicles.
    Color Preference: Analyze whether customers have a preference for black or grey cars.
    Regional Spending Patterns: Identify whether customers in specific regions are inclined to spend more on used cars.
    Seasonal Spending: Determine the time of the year when customers tend to spend more on used cars.

In summary, our data-driven tasks revolve around:

    Optimizing Inventory: Making informed decisions about the types of cars to stock.
    Competitive Pricing: Setting competitive prices based on regional insights.
    Targeted Marketing: Running targeted marketing campaigns during sales downturns.
    Customer Behavior Modeling: Targeting potential buyers through data-driven models.

Business Goals and KPIs Remain the Same:

    Gain insights into customer buying behavior.
    Achieve a 10% year-over-year increase in profit.
    Reduce annual marketing costs by 10%.

Data Understanding:

After considering the business perspective, it's essential to acquaint ourselves with the dataset and identify potential data quality issues. Here are the steps we'll take to accomplish this:

    Dataset Overview: Our dataset comprises 426,880 unique customers and contains 18 distinct columns.

    User Attributes: These include unique customer identifiers, city names, and state of car sale.

    Sales Price Attributes: This category features the price of used cars.

    Car Attributes: This group encompasses various car-related information, such as the car's manufacturing year, manufacturer, model, condition, cylinders, fuel type, odometer reading, title status, transmission type, drive, size, type, and paint color.

Data Quality Assessment:

    The data in the "Car attributes" category contains missing values that need interpretation.
    Specifically, there are missing values in columns such as "year," "manufacturer," "model," "condition," "cylinders," "fuel," "odometer," "title_status," "transmission," "VIN," "drive," "size," "type," and "paint_color."
    The "VIN" column is unnecessary for interpreting car sales and can be safely dropped.

Quality of the data

The Car attributes data has NULL values and would require the missing data to be interpreted.

    year 1205
    manufacturer 17646
    model 5277
    condition 174104
    cylinders 177678
    fuel 3013
    odometer 4400
    title_status 8242
    transmission 2556
    VIN 161042
    drive 130567
    size 306361
    type 92858
    paint_color 130203


In [None]:
import matplotlib.pyplot as pd
import numpy as np
import pandas as pd
%matplotlib inline
import pickle
import warnings
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
from sklearn.linear_model import LassoCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.feature_selection import RFE
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.linear_model import (BayesianRidge, Lasso, LassoCV,
                                  LinearRegression, Ridge, RidgeCV)
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                             mean_squared_log_error, r2_score)
from sklearn.model_selection import (GridSearchCV, KFold, StratifiedKFold,
                                     cross_val_score, train_test_split)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from tqdm import tqdm

warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("C:/Users/fabia/OneDrive/Desktop/vehicles.csv")

In [None]:
df.head(15)

In [None]:
df.info()
df.describe()

In [None]:
print("There are " + str(df.shape[0]) + " rows and " + str(df.shape[1]) + " columns in Dataset!")

In [None]:
px.histogram(df.nsmallest(n=426000, columns=['price']), x="price", nbins=20, title="Price histogram")

In [None]:
plt.hist(df['year'], bins=20, edgecolor='k')
plt.title("Number of cars categorized on Year")
plt.xlabel("Year")
plt.ylabel("Frequency")

# Show the plot
plt.show()

In [None]:
px.histogram(df, x="manufacturer", nbins=20, title="Number of cars categorized on Manufacturer")

In [None]:
top_50_models = df['model'].value_counts()[:50]

# Create a bar plot
plt.figure(figsize=(12, 6))
plt.bar(top_50_models.index, top_50_models.values, edgecolor='k')
plt.title("Number of cars categorized on Model (Top 50 models)")
plt.xlabel("Model")
plt.ylabel("Frequency")
plt.xticks(rotation=90)
plt.show()

In [None]:
# Create a bar plot for car condition
plt.figure(figsize=(10, 6))
df['condition'].value_counts().plot(kind='bar', edgecolor='k')
plt.title("Number of cars categorized on Car Condition")
plt.xlabel("Car Condition")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
df['cylinders'].value_counts().plot(kind='bar', edgecolor='k')
plt.title("Number of Cars Categorized on Cylinders")
plt.xlabel("Cylinders")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
df['fuel'].value_counts().plot(kind='bar', edgecolor='k')
plt.title("Number of Cars Categorized on Fuel Type")
plt.xlabel("Fuel Type")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(df['odometer'], bins=500, edgecolor='k')
plt.title("Odometer Histogram")
plt.xlabel("Odometer Value")
plt.ylabel("Frequency")
plt.show()

In [None]:
px.histogram(df, x="title_status", title="Number of cars categorized on Title status")

In [None]:
px.histogram(df, x="transmission", title="Number of cars categorized on Transmission")

In [None]:
px.histogram(df, x="drive", title="Number of cars categorized on Drive type")

In [None]:
px.histogram(df, x="size", title="Number of cars categorized on Size")

In [None]:
px.histogram(df, x="type", title="Number of cars categorized on Type")

In [None]:
px.histogram(df, x="paint_color", title="Number of cars categorized on Color")


Observations

    Many categorical columns have Nan values. We need to either fill them in or remove the rows altogather.
    Some of columns have outliers. This needs to be fixed.
    Target column 'Price' has outliers and needs fixing. Price column is skewed as well.




Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling. Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with sklearn.


In [None]:
cars = df.reindex(columns=[
    'id', 'region', 'year',
    'manufacturer', 'model', 'condition',
    'cylinders', 'fuel', 'odometer',
    'title_status', 'transmission',
    'VIN', 'drive', 'size',
    'type', 'paint_color', 'state', 'price'])

In [None]:
cars.head()

In [None]:
print("Below are number of Nan values in each column!")
cars.isnull().sum()

In [None]:
print("Dropping ID and VIN since it does not affect car prices!")
print("Dropping state and region as this does not affect "
      "car prices much when there is demand!")
cars = cars.drop(columns=['id', 'VIN', 'state', 'region'])

In [None]:
fig = px.imshow(cars.isnull())
fig.update_layout(
    title = "Heatmap showing Nan values in each column")
fig.update_layout(barmode='group', bargap=0.30,bargroupgap=0.0)
fig.show()

In [None]:
num_features = [
    'year',
    'odometer'
]
cat_features = [
    'manufacturer',
    'model',
    'condition',
    'cylinders',
    'fuel',
    'title_status',
    'transmission',
    'drive',
    'size',
    'type',
    'paint_color'
]
print(f"These are numerical features in dataset: {num_features}")
print(f"These are categorical features in dataset: {cat_features}")

In [None]:
cars_imputer = cars.copy()

encoder = preprocessing.LabelEncoder()

def encode(data):
    nonulls = np.array(data.dropna())
    impute_reshape = nonulls.reshape(-1,1)
    impute_ordinal = encoder.fit_transform(impute_reshape)
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
    return data

for i in tqdm(range(len(cat_features))):
    encode(cars_imputer[cat_features[i]])

In [None]:
estimators = [
    BayesianRidge(),
    DecisionTreeRegressor(
        max_features='sqrt',
        random_state=0
    ),
    ExtraTreesRegressor(
        n_estimators=10,
        random_state=0
    ),
    KNeighborsRegressor(
        n_neighbors=15
    )
]
score = pd.DataFrame()
for estimator in estimators:
    print(f"Estimating using {estimator.__class__.__name__} estimator!")
    imputer = IterativeImputer(estimator)
    cars_impute = cars_imputer.copy()
    for col in cars_imputer.columns:
        impute_data=imputer.fit_transform(
            cars_impute[col].values.reshape(-1,1)
        )
        impute_data=impute_data.astype('int64')
        impute_data = pd.DataFrame(
            np.ravel(impute_data)
        )
        cars_impute[col]=impute_data
    X = cars_impute.iloc[:,:-1]
    y = np.ravel(cars_impute.iloc[:,-1:])
    score[estimator.__class__.__name__] = cross_val_score(
        estimator,
        X,
        y,
        scoring='neg_mean_squared_error',
        cv=6
    )
del cars_imputer

In [None]:
# MSE scores of each estimator for cv=6
score

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
means = -score.mean()
errors = score.std()
means.plot.barh(xerr=errors, ax=ax)
ax.set_title('MSE with Different Imputation Methods')
ax.set_xlabel('MSE')
ax.set_yticks(np.arange(means.shape[0]))
ax.set_yticklabels(means.index.tolist())
plt.tight_layout(pad=1)
plt.show()


Above figure shows that Bayesian Ridge Imputer is best with lower MSE

In [None]:
# Nan values in Numerical features
cars.isnull().sum()[num_features]

In [None]:
cars_num = cars[num_features]

# Using estimators[0] = BayesianRidge to fill Nan values in Numerical features.
imputer_num = IterativeImputer(estimators[0])
impute_data = imputer_num.fit_transform(cars_num)
cars[num_features] = impute_data

In [None]:
# Missing values after filling
cars.isnull().sum()[num_features]

# Nan values in Categorical features
cars.isnull().sum()[cat_features]

# Using BayesianRidge imputer for categorical columns as well.
cars_cat = cars[cat_features]
encoder=preprocessing.LabelEncoder()

for columns in cat_features:
    encode(cars_cat[columns])
    imputer = IterativeImputer(BayesianRidge())
    impute_data = imputer.fit_transform(cars_cat[columns].values.reshape(-1, 1))
    impute_data = impute_data.astype('int64')
    impute_data = pd.DataFrame(impute_data)
    impute_data = encoder.inverse_transform(impute_data.values.reshape(-1, 1))
    cars_cat[columns]=impute_data
cars[cat_features]=cars_cat    

cars.isnull().sum()[cat_features]

fig = px.imshow(cars.isnull())
fig.update_layout(
    title = "Heatmap showing all Nan values are eliminated!")
fig.update_layout(barmode='group', bargap=0.30,bargroupgap=0.0)
fig.show()

In [None]:
def outliers_range(arr: list, col: str) -> tuple:
    """
    Function to find outliers range for given Array and column
    """
    x_values = sorted(arr[col].values.ravel())
    q_25 = 25 / 100 * (len(x_values) + 1)
    i_p = int(str(q_25).split(".")[0])
    f_p = int(str(q_25).split(".")[1])
    q1 = x_values[i_p] + f_p * (x_values[i_p + 1] - x_values[i_p])
    q_75 = 75/100*(len(x_values)+1)
    i_p = int(str(q_75).split(".")[0])
    f_p = int(str(q_75).split(".")[1])
    q3 = x_values[i_p] + f_p * (x_values[i_p + 1] - x_values[i_p])
    iqr = q3 - q1
    x_values_1 = q1 - 1.5 * iqr
    x_values_2 = q3 + 1.5 * iqr
    return (x_values_1, x_values_2)

In [None]:
def min_max_price(df: pd.DataFrame) -> tuple:
    """
    Function to find min and max price to remove outliers
    """
    range_ = []
    q1, q3 = (df['logprice'].quantile([0.25,0.75]))
    range_.append(q1 - 1.5 * (q3 - q1))
    range_.append(q3 + 1.5 * (q3 - q1))
    return (range_)

# Adding logprice since price column is skewed. This brings normal distribution to price column.
cars['logprice'] = np.log(cars['price'])
x = cars['logprice']
price_range = list(range(0, int(max(cars['logprice'])) + 1))
red_square = dict(markerfacecolor='g', marker='s')
plt.boxplot(x, vert=False)
plt.xticks(price_range)
plt.text(min_max_price(cars)[0]-0.3,1.05,str(round(min_max_price(cars)[0],2)))
plt.text(min_max_price(cars)[1]-0.5,1.05,str(round(min_max_price(cars)[1],2)))
plt.title("Box Plot of Price")
plt.show()

Above Box plot shows that Prices below log 6.43 and above 12.44 are outliers.

In [None]:
fig, ax1 = plt.subplots()
ax1.set_title('Figure 2: Box Plot of Odometer')
ax1.boxplot(cars['odometer'], vert=False, flierprops=red_square)
plt.show()

Above box plot shows that Odometer rating anything below -107725.0 and above 282235.0 are outliers

In [None]:
fig,(ax1,ax2)=plt.subplots(ncols=2,figsize=(12,5))

#ploting boxplot
o1, o2 = outliers_range(cars,'year')
ax1.boxplot(sorted(cars['year']), vert=False, flierprops=red_square)
ax1.set_xlabel("Years")
ax1.set_title("Figure 3: Box Plot of Year")
ax1.text(o1-8,1.05,str(round(o1,2)))

#ploting histogram
hist,bins=np.histogram(cars['year'])
n, bins, patches = ax2.hist(x=cars['year'], bins=bins)
ax2.set_xlabel("Years")
ax2.set_title("Figure 4: Histogram of Year")
for i in range(len(n)):
    if(n[i]>2000):
        ax2.text(bins[i],n[i]+3000,str(n[i]))

plt.tight_layout()
plt.show()


Above box plot shows that anything below 1995 and above 2022 are outliers.

In [None]:
# Removing outliers using outliers_range() funciton on logprice, odometer and year columns
cars_new = cars.copy()
out = np.array([
    'logprice',
    'odometer',
    'year'
])
for col in out:
    o1,o2 = outliers_range(cars_new, col)
    cars_new = cars_new[(cars_new[col]>=o1) & (cars_new[col]<=o2)]
    print('IQR of',col,'=',o1,o2)
cars_new = cars_new[cars_new['price']!=0]
cars_new.drop('logprice',axis=1,inplace=True)

In [None]:
print(f"Shape before process={cars.shape}")
print(f"Shape After process={cars_new.shape}")
print(
    f"Total {cars.shape[0]-cars_new.shape[0]} rows "
    f"and {cars.shape[1]-cars_new.shape[1]} columns were removed")
cars_new.to_csv("vehicles_finalized.csv",index=False)

cars_new.head()

Summarizing the Data Cleanup Steps:

Column Removal:

    We removed the "VIN" column since it didn't provide any valuable information for price prediction. Additionally, the "state" column was dropped as it duplicated information already present in the "region" column.

Handling Missing Values:

    To address missing values in categorical columns, we employed several regression estimators like BayesianRidge, DecisionTreeRegressor, ExtraTreesRegressor, and KNeighboursRegressor. After evaluating their performance, BayesianRidge yielded the lowest Mean Squared Error (MSE), and we used it to impute missing values in categorical columns.

Outlier Detection and Removal:

    Outliers were identified in the "Price," "Odometer," and "Year" columns using the Interquartile Range (IQR) method.
    A total of 62,427 rows were removed during the outlier removal process for these columns.

In [None]:
cars_cleaned = cars_new.copy()
cars_cleaned['year'] = cars_cleaned['year'].astype('int64')

In [None]:
cars_cleaned.shape

In [None]:
cars_cleaned.columns

cars_sample = cars_cleaned.sample(1000)
cars_sample.shape

In [None]:
# Plotting a pairplot to view distribution of numerical features.
colors = iter([
    'xkcd:red purple', 'xkcd:pale teal', 'xkcd:warm purple',
    'xkcd:light forest green', 'xkcd:blue with a hint of purple',
    'xkcd:light peach', 'xkcd:dusky purple', 'xkcd:pale mauve',
    'xkcd:bright sky blue'])

def my_scatter(x,y, **kwargs):
    kwargs['color'] = next(colors)
    plt.scatter(x,y, **kwargs)

def my_hist(x, **kwargs):
    kwargs['color'] = next(colors)
    plt.hist(x, **kwargs)

g = sns.PairGrid(cars_sample)
g.map_diag(my_hist)
g.map_offdiag(my_scatter)

In [None]:
fig = px.histogram(cars_cleaned, x="price", nbins=20, title="Price histogram")
fig.show()

In [None]:
def barplot_generator(df=pd.DataFrame(), x='', y='', title='', hue=''):
    """
    Function which take df, x, y, title and hue as input
    and generates a bar plot using seaborn.barplot.
    """
    fig, axis=plt.subplots()
    if hue:
        fig.set_size_inches(10, 6)
        sns.barplot(x=x, y=y, data=df, ax=axis, hue=hue)
    else:
        fig.set_size_inches(10, 6)
        sns.barplot(x=x, y=y, data=df, ax=axis)
    axis.set_title(title)
    plt.xticks(rotation=45)
    plt.show()

In [None]:
barplot_generator(cars_cleaned, 'fuel', 'price', 'Car price by Fuel Type')


Hybrid cars have lower price. Diesel cars cost more than electric ones.

In [None]:
barplot_generator(cars_cleaned, 'fuel', 'price', 'Car price by Fuel Type with condition as hue', hue='condition')


Irrespective of fuel type, Car condition decides the car prices. Salvaged cars have lower price point

In [None]:
barplot_generator(cars_cleaned, 'year', 'price', 'Car price by Year')


Car prices are ever increasing starting 2000

In [None]:
barplot_generator(cars_cleaned, 'condition', 'price', 'Car price by Condition', hue='size')

In [None]:
barplot_generator(cars_cleaned, 'transmission', 'price', 'Car price by Transmission')


The above 2 plots clearly shows that car condition drives the car price. Size of car impacts the prices as well.

In [None]:
barplot_generator(cars_cleaned, 'type', 'price', 'Car price by Type')


Manual car prices are low. Other types transmission have higher prices.

In [None]:
barplot_generator(cars_cleaned, 'manufacturer', 'price', 'Car price by manufacturer')

In [None]:
barplot_generator(cars_cleaned, 'size', 'price', 'Car price by Size')

Moving forward with modeling:

Now that we have our nearly finalized dataset, it's time to construct a variety of regression models with the target variable being "price." During this modeling phase, we will consider different model types and explore various parameters. Additionally, we will conduct cross-validation to validate our model findings and ensure their robustness.

If you have any specific questions or need guidance on particular aspects of the modeling process, feel free to share, and I'll be happy to assist further.

In [None]:
num_features = ['year','odometer']
cat_features = [
    'manufacturer',
    'model',
    'condition',
    'cylinders',
    'fuel',
    'title_status',
    'transmission',
    'drive',
    'size',
    'type',
    'paint_color'
]

In [None]:
label_encoder = preprocessing.LabelEncoder()
cars_cleaned[cat_features] = cars_cleaned[cat_features].apply(
    label_encoder.fit_transform)

In [None]:
cars_cleaned

In [None]:
# Scaling numerical data
norm = StandardScaler()
cars_cleaned['price'] = np.log(cars_cleaned['price'])
cars_cleaned['odometer'] = norm.fit_transform(np.array(cars_cleaned['odometer']).reshape(-1,1))
cars_cleaned['year'] = norm.fit_transform(np.array(cars_cleaned['year']).reshape(-1,1))
cars_cleaned['model'] = norm.fit_transform(np.array(cars_cleaned['model']).reshape(-1,1))

# Scaling target variable
q1, q3 = (cars_cleaned['price'].quantile([0.25,0.75]))
o1 = q1-1.5*(q3-q1)
o2 = q3+1.5*(q3-q1)
cars_cleaned = cars_cleaned[(cars_cleaned.price>=o1) & (cars_cleaned.price<=o2)]

cars_cleaned.head()



In [None]:
def split_dataset(df, n):
    """
    Function to split training and test dataset
    """
    X = df.iloc[:,n]
    y = df.iloc[:,-1:].values.T
    y = y[0]
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        train_size=0.9,
        test_size=0.1,
        random_state=0
    )
    return (X_train,X_test,y_train,y_test)

X_train, X_test, y_train, y_test = split_dataset(
    cars_cleaned,
    list(range(len(list(cars_cleaned.columns))-1))
)

In [None]:
def remove_neg(y_test, y_pred):
    """
    Function to remove negative values predicted by models.
    """
    index_ = [index for index in range(len(y_pred)) if(y_pred[index]>0)]
    y_pred = y_pred[index_]
    y_test = y_test[index_]
    y_pred[y_pred<0]
    return (y_test,y_pred)

def evaluate(y_test, y_pred):
    """
    Function to evalute the model
    """
    result = []
    result.append(mean_squared_log_error(y_test, y_pred))
    result.append(np.sqrt(result[0]))
    result.append(r2_score(y_test,y_pred))
    result.append(round(r2_score(y_test,y_pred)*100,4))
    return (result)

# Dataframe to store the performance of each model
# Using MSLE since we have applied logarithmic to price target variable.
accuracy = pd.DataFrame(index=['MSLE', 'Root MSLE', 'R2 Score','Accuracy(%)'])  


Linear regression with RFE

In [None]:
# Create a pipeline
pipeline = Pipeline([
    ('feature_selection', RFE(LinearRegression())),
    ('model', LinearRegression())
])

# Define hyperparameters
hyper_params = {
    'feature_selection__n_features_to_select': list(range(1, 14))
}

# Create KFold cross-validator
folds = KFold(n_splits=5, shuffle=True, random_state=100)

# Perform GridSearchCV
model_cv = GridSearchCV(
    estimator=pipeline,
    param_grid=hyper_params,
    scoring='r2',
    cv=folds,
    verbose=1,
    return_train_score=True
)

# Fit the model
model_cv.fit(X_train, y_train)

# Get CV results
cv_results = pd.DataFrame(model_cv.cv_results_)

In [None]:
alphas = 10**np.linspace(10, -2, 400)

# Initialize RidgeCV with the alphas
ridge_cv = RidgeCV(alphas=alphas, store_cv_values=True)

# Fit RidgeCV to your data
ridge_cv.fit(X_train, y_train)

# Get the cross-validated mean squared errors for each alpha
cv_mean_errors = np.mean(ridge_cv.cv_values_, axis=0)

# Plot the results
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, cv_mean_errors, '-o', color='b', markersize=5, label='Mean CV MSE')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Mean CV MSE')
plt.title('Alpha Selection for Ridge Regression')
plt.grid(True)
plt.legend()
plt.show()

# Print the best alpha
best_alpha = ridge_cv.alpha_
print(f"Best alpha: {best_alpha}")

In [None]:
# Create a list of alpha values to test
alphas = 10**np.linspace(10, -2, 400)

# Initialize LassoCV with the alphas
lasso_cv = LassoCV(alphas=alphas)

# Fit LassoCV to your data
lasso_cv.fit(X_train, y_train)

# Get the best alpha
best_alpha = lasso_cv.alpha_

# Plot the results
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, lasso_cv.mse_path_, ':')
plt.plot(alphas, lasso_cv.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(best_alpha, linestyle='--', color='k', label='Best alpha')
plt.legend()
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Mean Square Error (MSE)')
plt.title('Alpha Selection for Lasso Regression')
plt.grid(True)
plt.show()

print(f"Best alpha: {best_alpha}")

In [None]:
# model object and fitting it
lasso_model = Lasso(alpha=0.010)
lasso_model.fit(X_train,y_train)
y_pred = lasso_model.predict(X_test)

In [None]:
# calculating error/accuracy

y_test_3, y_pred_3 = remove_neg(
    y_test,
    y_pred
)
r3_lasso = evaluate(y_test_3,y_pred_3)

print(f"MSLE : {r3_lasso[0]}")
print(f"Root MSLE : {r3_lasso[1]}")
print(f"R2 Score : {r3_lasso[2]} or {r3_lasso[3]}%")

accuracy['Lasso Regression'] = r3_lasso

In [None]:
fig = px.scatter(x=y_test, y=y_pred, labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Lasso Model: Used Car Prediction with Log price")
fig.show()

In [None]:
fig = px.scatter(x=np.exp(y_test), y=np.exp(y_pred), labels={'x': "Actual Car Price", 'y': "Predicted Car Price"}, title="Lasso Model: Used Car Prediction with Actual price")
fig.show()

In [None]:
fig = px.bar(x=X_train.columns, y=lasso_model.coef_, title="Lasso Model Coefs")
fig.show()

Recommendations to Car Dealership:

Based on our analysis, here are some key recommendations that car dealerships can use to optimize their used car inventory and drive sales while ensuring customer satisfaction:

    Prioritize Year and Odometer: Consumers highly value the year of manufacture and odometer reading, which significantly influence the price range of a car. Focus on offering cars with favorable year and mileage attributes to attract more buyers.

    Consider Diesel and Electric Cars: Diesel and electric vehicles tend to command higher prices compared to gasoline cars. Expanding the inventory with these options can help increase overall revenue.

    Emphasize High Cylinder Counts: Cars with more cylinders tend to have higher price points. Consider offering cars with higher cylinder counts to cater to customers looking for performance-oriented vehicles.

    Monitor Title Status and Condition: Be mindful of the title status and condition of the cars in your inventory. Salvaged cars can significantly reduce prices, so it's essential to properly assess and price them accordingly.

    Diversify Transmission Types: Different transmission types impact car prices differently. Automatic and other transmission types typically have higher price points, while manual transmissions tend to lower the car's price. Diversify your inventory to cater to various preferences.

def predict_car_price(year, odometer, manufacturer, condition, cylinders, fuel, transmission, drive, size, type, paint_color, model):
    """
    Predicts the price of a used car.

    Args:
        year (int): Year of the car (1995 to 2022).
        odometer (int): Mileage of the car (integer greater than 0).
        manufacturer (str): Car manufacturer.
        condition (str): Car condition.
        cylinders (str): Number of cylinders in the car.
        fuel (str): Type of fuel used.
        transmission (str): Type of transmission.
        drive (str): Drive type.
        size (str): Car size.
        type (str): Car type.
        paint_color (str): Car paint color.
        model (str): Car model.


Deployment:

Having finalized our models and findings, it's time to convey this information effectively to our client. We will present our work in the form of a concise report that highlights our primary discoveries. It's important to remember that our audience consists of used car dealers who are keen to fine-tune their inventory strategies.

Addressing the Needs of Used Car Dealers:

As a used car dealership, understanding what consumers value in a used car is crucial. Leveraging the data you've provided, we've diligently grouped and analyzed it to offer valuable insights aimed at enhancing customer conversion rates. Our analysis has illuminated key factors that predict customer interest in purchasing a car, including:

    The car's manufacturing year.
    The car's size.
    The car's condition.

Our findings underscore that a recently manufactured car in good condition tends to outperform older, poorly maintained vehicles in terms of sales.

Future Endeavors:

While this analysis provides a solid foundation for comprehending customer behavior, it hasn't revealed any groundbreaking insights. To delve deeper into this realm, we would like to conduct further analyses. This entails addressing data gaps, such as:

    Obtaining purchase and selling prices for cars.
    Identifying the manufacturing year and the year in which the car was sold.
    Rectifying missing or incorrect values, such as rows with both odometer and price values set to 0, clarifying if an odometer value of 0 signifies a new car, and filling missing data for model and manufacturer using a VIN database if available.
    Investigating the root causes of outliers within the data.

Conclusion:

At this juncture, it's evident that additional data is necessary to provide a definitive recommendation. While our analysis has been insightful, it remains inconclusive. We look forward to further exploring the data to provide more comprehensive guidance in the future.