14. I denna uppgift ska vi arbeta med datasetet “car_price_dataset.csv”.

a) Gör ett komplett ML flöde där den beroende variabeln, 𝑦, är price. Vi
vill alltså skapa en modell som kan prediktera en bils pris. Ett tips är att
först göra ett enkelt flöde som fungerar, har du därefter tid kan du försöka
förbättra resultaten.

b) Gör en applikation med hjälp av Streamlit där användaren kan ma-
ta in olika uppgifter om en bil och därefter få ett predikterat pris.
Du kan se följande video för att lära dig mer om hur Streamlit
fungerar: https://www.youtube.com/watch?v=ggDa-RzPP7A&list=
PLgzaMbMPEHEx9Als3F3sKKXexWnyEKH45&index=11&t=158s

c) Det du gjort i denna uppgift, hade det kunnat användas i verkligheten?
Till vad i sådana fall?

In [31]:
# a)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score

In [32]:
# Load the dataset, make sure values are separated by the correct delimiter

df = pd.read_csv("data/car_price_dataset.csv", sep=";")
df.head()

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet,Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes,GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi,Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen,Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867


In [33]:
# Define features (X) and the target variable (y)

X = df.drop(["Price"], axis=1)
y = df["Price"]

In [34]:
# Split the columns into which are categorical and which are numerical for preprocessing

categorical_cols = ["Brand", "Model", "Fuel_Type", "Transmission"]
numerical_cols = ["Year", "Engine_Size", "Mileage", "Doors", "Owner_Count"]


In [35]:
# Define preprocessing for numerical features: impute missing values with median
numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

# Define preprocessing for categorical features: impute with most frequent and one hot encode
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

# Combine numerical and categorical transformers into a single ColumnTransFormer
preprocessor = ColumnTransformer(transformers=[
    ("num", numerical_transformer, numerical_cols),
    ("cat", categorical_transformer, categorical_cols),
])

In [36]:
# Build a full pipeline that includes preprocessing and a Random Forest model
model_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(n_estimators=100, random_state=42)),
])

In [37]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
# Train the pipeline on the training data
model_pipeline.fit(X_train, y_train)

In [39]:
# Use the trained pipeline to predict car prices on the test data
y_pred = model_pipeline.predict(X_test)

In [40]:
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"MSE: {mse}")
print(f"R2: {r2}")

MAE: 260.725625
RMSE: 334.7633596930076
MSE: 112066.50699295
R2: 0.9878026989626942


Very healthy model behaviour

In [41]:
# Checking the average price to compare
print(df["Price"].describe())

count    10000.00000
mean      8852.96440
std       3112.59681
min       2000.00000
25%       6646.00000
50%       8858.50000
75%      11086.50000
max      18301.00000
Name: Price, dtype: float64


In [45]:
# b)
import joblib

# Save the trained pipeline
joblib.dump(model_pipeline, "models/car_price_model.pkl")

# Continues in the streamlit_app.py

['models/car_price_model.pkl']

c) Yes it could be used for a price predictor so you could get a good sense of what your car is worth.