# Subproject 1 – Used Car Prices Prediction
Machine Learning – M.Sc. in Electrical and Computer Engineering


Importing libraries

In [None]:
import pandas as pd

In [None]:
import numpy as np
import matplotlib.pyplot as plt


Reading datasets files and making dataframes using pandas library

In [None]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_sample_submission = pd.read_csv("sample_submission.csv")

# Exploratory Data Analysis (EDA)

After reading and making DataFrames of the Datasets, i will be analysing the data.

Purpose: 

This will let me understand the kind of datatype i am working. 
It will let me understand the kind of features that are present in the Data, if they are categorical or numerical features.
If there there are missing values in the data, and the type (if any). This make me have better decision on methods i can use to fill up the values.

To do this i will be making use of some methods from the pandas library such as "shape()", "head()", "info()", "dtypes()", "describe()". 

During this EDA, i wil also be exploring some other related and important part which wich will let me make better decision when training tdata.

using the shape() method, will give the dimension (row x columns) of the data i am working with.

In [None]:
# for train data

df_train.shape

In [None]:
# for test data

df_test.shape

the head() method will show first rows of data but limited, "n" can be passed as an argument, if we want specific amount of row to be shown.

passing n as an argument here will show the first n rows from index (0 to n-1).

In [None]:
# for train data

df_train.head()

In [None]:
# for test data

df_test.head()

Using info() method to see the summary details about the dataframes like the index, datatypes, columns, non-null values and the memory usage.

Ths method also shows the total numbers of values for each of the columns present in the datasets

In [None]:
#  for the train data

df_train.info()

In [None]:
#  for the test data

df_test.info()

Target visualization

In [None]:
# from the price distribution plot below, it is noticed that the target ['price'] of the train data is skewed with fewer high-priced car.

# The target distribution was analyzed to identify skewness and outliers in used car prices, which directly affects model choice and potential target transformations.

# Due to the skewness from the target, we can make use of log transform, beacuse it compreses the high-price tail and also affect the proportionality of the errors.

# price distrubution  visualization
plt.figure(figsize=(6,4))
plt.hist(df_train["price"])
plt.title("Distribution of Car Prices")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

separating numerical column and categorical column:

this part shows the numerical columns and categorical features columns we have. Doing this made me understand the if there will be need for data encoding or not.

In [None]:
#  for train data

numerical_columns = df_train.select_dtypes(include=["int64", "float64"]).columns
categorical_columns = df_test.select_dtypes(include=["object", "bool", "category"]).columns


Missing values

the pandas library provides isnull() to check the column with missing values, sum() will give the total sum of the missing values if any exist,  and the sort_values() for sorting either in ascending or descending order.

In [None]:
# from pandas dataframe, we can use the .mean() method to find the mean for the each colums with missing values, then convert it to percentage multiplying be 100.

#  Missing value analysis was performed to quantify data incompleteness and justify the imputation strategies applied during preprocessing.

#  for train data
train_missing_values =df_train.isnull().sum().sort_values(ascending=False)
train_missing_columns = df_train.columns[df_train.isna().any()].tolist()

train_missing_percentage = df_train.isnull().mean().sort_values(ascending=False) * 100

train_missing_columns

In [None]:
#for test data

df_test.isnull().sum().sort_values(ascending=False)

train_missing_percentage[train_missing_percentage > 0]

In [None]:
# the plot for columns with missing data
(df_train.drop(['price'], axis=1).isnull().mean() * 100).plot(kind="bar")
plt.title("Missing values")
plt.ylabel("missing")
plt.show()


# Data preprocessing

## Filling missing data

From using info() method when doing EDA above on the train data, It was seen that there are some missing values. Like in the fuel_type, accident and the clean_title features.

This missing values are categorical features, therefore there is need to fill the the missing values before training because model can not be trained without with NaNs.

In [None]:
# For replacing the missing values in the training set, i used the mode() function from pandas to get the mode and using fillna() function from pandas library  to fill the positions where there was a misiing values.

for column in train_missing_columns:
    #  calculates the mode of the values.
    replace_value = df_train[column].mode()[0]

    # then filled it to the missing values position.
    df_train[column] = df_train[column].fillna(replace_value)

# For the test set, i filled only missing values for columns that are present in the train data set and excluding 'price'.
# Doing this bring cosistency and more accuracy when training. As we are dealing with unseen data, we can't be sure if the unseen data will have additional column or not, or if they will contain some missing data, so it is best to take care of that. 
 
#  this line will select only the column with missing data in test data that are also present in the train data 
missing_columns_test = [col for col in train_missing_columns if col in df_test.columns]

for column in missing_columns_test:
    # Use the mode calculated from the training set for consistency
    replace_value = df_train[column].mode()[0]
    df_test[column] = df_test[column].fillna(replace_value)


In [None]:
#confirming if the missing data was successfully filled

df_train.info()

In [None]:
df_test.info()

## Encoding Categorical Features

it was seen from previous that there are some categorical features present in our datasets, there these categorical need to be encoded to be numerical features.

To do this, i used the get_dummies() method that implements OneHotEncoding of categorical data from the pandas library.

In [None]:
df_train_encoded = pd.get_dummies(df_train, drop_first=True)

df_test_encoded = pd.get_dummies(df_test, drop_first=True)
df_test_encoded = df_test_encoded.reindex(columns=df_train_encoded.columns, fill_value=0)

In [None]:
df_train_encoded.describe()

In [None]:
df_train_encoded.head()

In [None]:
df_train_encoded.drop(['price'], axis=1).plot(kind='box', figsize=(20,10))

# Model Training

After I analyzed the Data, filled the columns with missing values and encoded the categorical features. Then i moved to training the model.

First aseline models without tuning or normalization

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

Created 2 functions

a. getXy : split the data to features and target
b. run_base_models: it will run base models inside it, plot and give the score for each of them

In [None]:
def getXy(df):
    X = df.drop(['price'], axis=1)
    y = df['price']

    return X,y

def run_base_models(df):
    X,y = getXy(df)
    # split the Dataframe
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=42, shuffle=False) 

    #  basemodels to use
    base_models = {
    "Linear regression": LinearRegression(),
    "Decision tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(max_depth=10),
    "k-NN": KNeighborsRegressor(n_neighbors=3)
    }
    
    fig,ax = plt.subplots(len(base_models),1, figsize=(20,10))
    base_model_scores ={}

    for idx, (name, model) in enumerate(base_models.items()):

        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        pred = model.predict(X_test)

        # i used numpy to compute each model rmse value
        rmse = np.sqrt(mean_squared_error(y_test, pred))
        
        print(f'{name}: score = {score}')

         # plot pred vs actual
        ax[idx].plot(y_test.values, pred, c='g', marker='o', linestyle='None')
        ax[idx].plot(y_test.values, y_test.values, c='r')
        ax[idx].set_ylabel('Predicted')
        ax[idx].set_xlabel('Actual')
        ax[idx].set_title(f'{name} / Score =  {score}/ RMSE = {rmse}')   
        
        base_model_scores[name] = score
        base_model_scores[name] = rmse

    return base_model_scores

Runs the baseline models and plots predicted vs actual.

In [None]:
models_scores = pd.DataFrame()
models_scores['Base Models'] = run_base_models(df_train_encoded)

Features Engineering

Scales features, log-transforms the target, builds polynomial features, and prepares test features.

In [None]:
from sklearn.preprocessing import StandardScaler

X, y = getXy(df_train_encoded)

standard_Scaler = StandardScaler().fit(X)

df_train_scaled = pd.DataFrame(standard_Scaler.transform(X), columns = X.columns)
df_test_scaled = standard_Scaler.transform(df_test_encoded)

# df_train_scaled['']



In [None]:
df_train_scaled.describe()

In [None]:
df_train_scaled.head()

In [None]:
X_scaled = df_train_scaled.drop(['price'], axis=1)
y_scaled = df_train_scaled['price']

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=.20, random_state=42, shuffle=True)

Models Tuning

Defines grid search helper with CV RMSE and predicted vs actual plots.

In [None]:

from sklearn.model_selection import GridSearchCV, KFold

results = []
best_models = {}

models = {
    "linreg": (LinearRegression(), {
        "fit_intercept": [True, False],
        "positive": [False, True]
    }),

    "knn": (KNeighborsRegressor(), {
        "n_neighbors": [3, 5, 7],
        "weights": ["uniform", "distance"],
        "metric": ["euclidean", "minkowski"],
        "p": [1, 2]
    }),

    "dt": (DecisionTreeRegressor(), {
        "max_depth": [None, 5, 10],
        "min_samples_split": [2, 5, 10],
        "min_samples_leaf": [1, 2, 4],
        "max_features": ["sqrt", "log2"],
        "ccp_alpha": [0.0, 0.001, 0.01]
    }),

    "rf": (RandomForestRegressor(), {
        "n_estimators": [100, 300],
        "max_depth": [None, 10, 20],
        "min_samples_split": [5, 10],
        "min_samples_leaf": [1, 2],
        "max_features": ["sqrt", "log2", None],
        "bootstrap": [True, False]
    }),
}

cv = KFold(n_splits=3, shuffle=True, random_state=42)

for name, (model, grid) in models.items():
    gs = GridSearchCV(
        estimator=model, 
        param_grid=grid, cv=cv,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1
    )
    gs.fit(X_train, y_train)
    results.append({"model": name, "rmse": -gs.best_score_, "params": gs.best_params_})
    best_models[name] = gs.best_estimator_

results_df = pd.DataFrame(results).sort_values("rmse")
results_df

Builds and displays the sorted grid-search results table.

In [None]:
results_df = pd.DataFrame(results).sort_values("rmse")
results_df


Kaggle Submission File Generation

Generates submission predictions from the best model and saves submission.csv.

In [None]:
# Get the name of the best model
best_model_name = results_df.iloc[0]['model']

# Retrieve the best estimator (pipeline) for that model
best = best_models[best_model_name]

# Preprocess the test set using the same preprocessor fitted on the training data
# Note: The preprocess object is already part of the best_pipeline

# Make predictions on the test set
predictions = best.predict(df_test)

# Create the submission DataFrame
submission_df = pd.DataFrame({'id': df_test['id'], 'price': predictions})

# Ensure prices are non-negative, as car prices cannot be negative
submission_df['price'] = submission_df['price'].apply(lambda x: max(0, x))

# Display the first few rows of the submission file
display(submission_df.head())

# Save to CSV for Kaggle submission
submission_df.to_csv('submission.csv', index=False)