## Used Car Price Prediction
### 1. Problem statement.  
- This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.  
- If user can predict the price of the car based on input features.  
- Prediction results can be used to give new seller the price suggestion based on market condition.  
  
    
### 2. Data Collection.
- The Dataset is collected from scrapping from cardheko webiste
- The data consists of 13 column and 15411 rows.

In [1]:
##importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings("ignore")

In [2]:
## loading the dataset
df = pd.read_csv("cardekho_dataset.csv", index_col=[0])

In [3]:
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


### Data Cleaning

#### Handling missing values

In [4]:
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [5]:
df.columns

Index(['car_name', 'brand', 'model', 'vehicle_age', 'km_driven', 'seller_type',
       'fuel_type', 'transmission_type', 'mileage', 'engine', 'max_power',
       'seats', 'selling_price'],
      dtype='object')

It means there are no missing values in the dataset.

In [6]:
## handling unncessary columns
df.drop("car_name", axis = 1, inplace = True)
df.drop('brand', axis= 1, inplace = True)

In [7]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [8]:
## getting unique models
df['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [9]:
## Indpendent and Dependent Feature
X = df.drop('selling_price', axis = 1)
y = df['selling_price']

Using Frequency Encoding to handle "model" column.

In [10]:
X

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.70,796,46.30,5
1,Grand,5,20000,Individual,Petrol,Manual,18.90,1197,82.00,5
2,i20,11,60000,Individual,Petrol,Manual,17.00,1197,80.00,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.10,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5
...,...,...,...,...,...,...,...,...,...,...
19537,i10,9,10723,Dealer,Petrol,Manual,19.81,1086,68.05,5
19540,Ertiga,2,18000,Dealer,Petrol,Manual,17.50,1373,91.10,7
19541,Rapid,6,67000,Dealer,Diesel,Manual,21.14,1498,103.52,5
19542,XUV500,5,3800000,Dealer,Diesel,Manual,16.00,2179,140.00,7


In [11]:
num_features = [features for features in df.columns if df[features].dtype != "O"]
print("Total Numerical_features:", len((num_features)))

cat_features = [features for features in df.columns if df[features].dtype == "O"]
print("Total Categorical_features:", len((cat_features)))

discrete_features = [features for features in num_features if len(df[features].unique()) <= 25]
print("Total Discrete_features:", len((discrete_features)))

continous_features = [features for features in num_features if features not in discrete_features]
print("Total Continous_features:", len((continous_features)))

Total Numerical_features: 7
Total Categorical_features: 4
Total Discrete_features: 2
Total Continous_features: 5


In [12]:
# Frequency Encoding
"""
freq = df['model'].value_counts()
df['model_encoded'] = df['model'].map(freq)

"""

"\nfreq = df['model'].value_counts()\ndf['model_encoded'] = df['model'].map(freq)\n\n"

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['model'] = le.fit_transform(X['model'])

In [14]:
## create column transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
one_hot_columns = ["seller_type", "fuel_type", "transmission_type"]


from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OHE", oh_transformer, one_hot_columns),
        ("Std Scaler", numeric_transformer, num_features)
    ], remainder="passthrough"
)

In [15]:
X = preprocessor.fit_transform(X)

In [16]:
## separtaing the dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Model Training and Model Selection

In [24]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

In [25]:
## creating function to evalueate model
##Create a Function to Evaluate Model
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, mse, rmse, r2_square

In [26]:
## Model Training
models = {
    "Linear Regression": LinearRegression(),
    "Adaboost Regression": AdaBoostRegressor(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
   
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_mse, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_mse, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print("Model Name:", list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    
    print('='*35)

Model Name: Linear Regression
Model performance for Training set
- Root Mean Squared Error: 553855.6665
- Mean Absolute Error: 268101.6071
- R2 Score: 0.6218
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 502543.5930
- Mean Absolute Error: 279618.5794
- R2 Score: 0.6645
Model Name: Adaboost Regression
Model performance for Training set
- Root Mean Squared Error: 461695.0089
- Mean Absolute Error: 339458.5111
- R2 Score: 0.7372
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 493489.9434
- Mean Absolute Error: 355864.7575
- R2 Score: 0.6765
Model Name: Lasso
Model performance for Training set
- Root Mean Squared Error: 553855.6710
- Mean Absolute Error: 268099.2226
- R2 Score: 0.6218
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 502542.6696
- Mean Absolute Error: 279614.7461
- R2 Score: 0.6645
Model Name: Ridge
Model performance for Training set
- Root

We can see that KNearest Neighbour and Decision Trees are performing well. So we are going to choose these two Algorithmns for Hyperparameter Tuning to find the best parameters.

In [27]:
## Defining the parameters for Hyperparameter Tune the model

knn_params = {"n_neighbors": [2, 3, 10, 20, 40, 50]}
rf_params = {"max_depth": [5, 8, 15, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}

ada_params={
    "n_estimators":[50,60,70,80],
    "loss":['linear','square','exponential']
}

## models for hyperparameter tune
randomcv_models = [('KNN', KNeighborsRegressor(), knn_params),
                   ("RF", RandomForestRegressor(), rf_params),
                   ("Adaboost",AdaBoostRegressor(),ada_params)
                   
                   ]

In [28]:
## Hyperparameter Tuning the model
from sklearn.model_selection import RandomizedSearchCV

model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator=model,
                                   param_distributions=params,
                                   n_iter=100,
                                   cv=3,
                                   verbose=2,
                                   n_jobs=-1)
    random.fit(X_train, y_train)
    model_param[name] = random.best_params_

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [29]:
for model_name in model_param:
    print(f"--- Best Params for {model_name} ---")
    print(model_param[model_name])

--- Best Params for KNN ---
{'n_neighbors': 10}
--- Best Params for RF ---
{'n_estimators': 100, 'min_samples_split': 2, 'max_features': 8, 'max_depth': 15}
--- Best Params for Adaboost ---
{'n_estimators': 50, 'loss': 'linear'}


In [31]:
## Training the model again with the best parameters
models = {
    "Random Forest Regressor": RandomForestRegressor(n_estimators=500, min_samples_split=2, max_features='sqrt', max_depth=None, 
                                                     n_jobs=-1),
     "K-Neighbors Regressor": KNeighborsRegressor(n_neighbors=10, n_jobs=-1),
     "Adaboost":AdaBoostRegressor(n_estimators=50,loss='linear')
    
}
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    model_train_mae , model_train_mse, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_mse, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)
    
    print("Model Name:", list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    
    print('='*35)

Model Name: Random Forest Regressor
Model performance for Training set
- Root Mean Squared Error: 133274.4862
- Mean Absolute Error: 39438.7220
- R2 Score: 0.9781
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 205549.4924
- Mean Absolute Error: 98426.5477
- R2 Score: 0.9439
Model Name: K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 363460.7706
- Mean Absolute Error: 103472.0474
- R2 Score: 0.8371
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 263888.0623
- Mean Absolute Error: 117496.2131
- R2 Score: 0.9075
Model Name: Adaboost
Model performance for Training set
- Root Mean Squared Error: 430947.6667
- Mean Absolute Error: 312958.6420
- R2 Score: 0.7710
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 459459.0250
- Mean Absolute Error: 327439.2371
- R2 Score: 0.7196
