# Prediction of car prices based on their condition


![image](archive/images/dataset-cover.jpg)

Data extracted from kaggle. Click [here](https://www.kaggle.com/datasets/sidharth178/car-prices-dataset) to see it.

## Table of Content:
1. [First view to data](#section1)
2. [](#section2)

## First view at the data <a id="section1"></a>

In [1]:
import numpy as np
import pandas as pd

# We already have the data partitioned
df_train = pd.read_csv("archive/train.csv")
print(f"Shape: {df_train.shape}")
df_train.head()

Shape: (19237, 18)


Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [2]:
import mitosheet
mitosheet.sheet(df_train)

Exception: The mitosheet currently only works in JupyterLab.

To see instructions on getting Mitosheet running in JupyterLab, find install instructions here: https://docs.trymito.io/getting-started/installing-mito

In [None]:
df_train.isna().sum() # Below we can see the data is extremely clean; this is not common at all

ID                  0
Price               0
Levy                0
Manufacturer        0
Model               0
Prod. year          0
Category            0
Leather interior    0
Fuel type           0
Engine volume       0
Mileage             0
Cylinders           0
Gear box type       0
Drive wheels        0
Doors               0
Wheel               0
Color               0
Airbags             0
dtype: int64

In [None]:
print("We have a {:.2%} of null levies".format(df_train['Levy'].apply(lambda x: x=='-').sum() / df_train.shape[0]))
print("It is very high, so we won't be using that feature")
X_train = df_train[df_train.columns.drop(["ID", "Price", "Levy"])]
Y_train = df_train["Price"]

We have a 30.25% of null levies
It is very high, so we won't be using that feature


### We first construct the Pipeline to prepare the data

In [None]:
transf_type = {"numeric": ["Prod. year", "Engine volume",  "Cylinders", "Airbags"],
                "cat_onehot": ["Category", "Fuel type", "Drive wheels"],
                "custom": [['Manufacturer', 'Model', 'Leather interior', 'Gear box type', 'Doors', 'Wheel', 'Color', "Mileage"]]} # Mileage first needs to be cleaned and later treated as a numeric attribute

In [None]:
from sklearn.preprocessing import OneHotEncoder
        
onehot_encoder = OneHotEncoder()
X_train_encoded = onehot_encoder.fit_transform(X_train[transf_type["cat_onehot"]])


print(X_train_encoded.toarray())
for i, j in zip(transf_type["cat_onehot"], onehot_encoder.categories_):
    print(f"{i}: {j}")

[[0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 1. 0.]]
Category: ['Cabriolet' 'Coupe' 'Goods wagon' 'Hatchback' 'Jeep' 'Limousine'
 'Microbus' 'Minivan' 'Pickup' 'Sedan' 'Universal']
Fuel type: ['CNG' 'Diesel' 'Hybrid' 'Hydrogen' 'LPG' 'Petrol' 'Plug-in Hybrid']
Drive wheels: ['4x4' 'Front' 'Rear']


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
    ])

num_pipeline.fit_transform(X_train[transf_type["numeric"]])

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: '2.0 Turbo'

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class CatTransformer(BaseEstimator, TransformerMixin):
    def __init__(self): # no *args or **kargs
        self.onehot_encoder = OneHotEncoder()
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        has_leather_interior = X["Leather interior"].apply(lambda x: 1 if x else 0) # Numeric atributes are much easier to manage
        
        df_train.drop(columns=["Levy"], inplace=True)
        return self.onehot_encoder.fit_transform(X) # Cambiar este; no hace falta clase. Usarlo para los custom en su lugar

In [None]:
Y_train = df_train["Price"]
X_train = df_train[df_train.columns.drop(["ID", "Price"])]

## License

This Jupyter Notebook and its contents are licensed under the terms of the GNU General Public License Version 2 as published by the Free Software Foundation. The full text of the license can be found at: https://www.gnu.org/licenses/gpl-2.0.html

Copyright (c) 2023, Joaquín Mateos Barroso

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/ for a list of additional licenses.