# CarDekho Used Car Price Prediction Project - 1 :-
### The CarDekho project has four different types volumns(vol-1, vol-2, vol-3, vol-4) and each vol. has different dataset and as we move ahead the datasets becomes more complex and difficult.
### This notebook is for Vol.- 1 which contains a simplest dataset. The aim is to train a machine learning model which predicts the selling price of the car based on the input features. 
### The dataset consist's of the following columns: 
1. Car_Name : Name of the car
2. Year : Year of registraion.
3. Present_Price : Current market value of the car.
4. Kms_Driven : No. of kms car driven.
5. Fuel_tpe : Fuel type of the car.
6. Seller_type : Wheater the entity  selling car is dearship or a individual person.
7. Transmission : Transmission type of the car.
8. Owner : Wheather a car is first owner, second owner etc.
9. Selling_Price (Target Column) : The selling price for the car based on the above mention columns.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [105]:
df = pd.read_csv("Datasets\car_data_v1.csv")
df.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [106]:
df.shape

(301, 9)

In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Car_Name       301 non-null    object 
 1   Year           301 non-null    int64  
 2   Selling_Price  301 non-null    float64
 3   Present_Price  301 non-null    float64
 4   Kms_Driven     301 non-null    int64  
 5   Fuel_Type      301 non-null    object 
 6   Seller_Type    301 non-null    object 
 7   Transmission   301 non-null    object 
 8   Owner          301 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 21.3+ KB


In [108]:
df.describe()

Unnamed: 0,Year,Selling_Price,Present_Price,Kms_Driven,Owner
count,301.0,301.0,301.0,301.0,301.0
mean,2013.627907,4.661296,7.628472,36947.20598,0.043189
std,2.891554,5.082812,8.644115,38886.883882,0.247915
min,2003.0,0.1,0.32,500.0,0.0
25%,2012.0,0.9,1.2,15000.0,0.0
50%,2014.0,3.6,6.4,32000.0,0.0
75%,2016.0,6.0,9.9,48767.0,0.0
max,2018.0,35.0,92.6,500000.0,3.0


## Data Preprocessing

In [109]:
## checking missing values
df.isna().sum() # also use df.isnull().sum()

Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

In [110]:
## Checking duplicates
duplicates = df[df.duplicated()]

In [111]:
df.drop_duplicates(duplicates, inplace=True)

In [112]:
df.shape

(299, 9)

In [113]:
df.drop("Car_Name", axis = 1, inplace=True)

In [114]:
## Handling the categorical columns

In [115]:
cat_cols = df.select_dtypes(include = "object").columns
cat_cols

Index(['Fuel_Type', 'Seller_Type', 'Transmission'], dtype='object')

In [116]:
numeric_cols = df.drop(columns=['Fuel_Type', 'Seller_Type', 'Transmission', 'Selling_Price'], axis = 1).columns
numeric_cols

Index(['Year', 'Present_Price', 'Kms_Driven', 'Owner'], dtype='object')

In [117]:
for col in cat_cols:
    print(df[col].value_counts())

Fuel_Type
Petrol    239
Diesel     58
CNG         2
Name: count, dtype: int64
Seller_Type
Dealer        193
Individual    106
Name: count, dtype: int64
Transmission
Manual       260
Automatic     39
Name: count, dtype: int64


In [118]:
## Applying OneHot encoding to these columns.

## Doing preprocessing using sklearn pipelines, column transformer. 
#### Follow these steps:
1. Creating a numeric pipeline where I initialize standard scaler.
2. Creating a categorical pipeline where I initialize OneHot encoder and then a standard scaler.
3. Initialize the column transformer object and pass both numeric pipeline & categorical pipeline as transformer along with the inforamation where to apply numeric, categorical pipeline.
4. fit_transform the column transformer object on X_train & transform the column transformer object on X_test to avoid data leakage.

In [119]:
X = df.drop("Selling_Price", axis = 1)
y = df['Selling_Price']

In [120]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [121]:
numeric_pipeline = Pipeline(
    steps = [
        ("Scaler", StandardScaler())
    ]
)
cat_pipeline = Pipeline(
    steps = [
        ("oh encoder", OneHotEncoder(sparse_output=False)),
        ("Scaler", StandardScaler())
    ]
)

In [122]:
preprocessor = ColumnTransformer(
    transformers = [
        ("num_pipeline", numeric_pipeline, numeric_cols),
        ("cat_pipeline", cat_pipeline, cat_cols)
    ],
    #remainder="passthrough"
)

X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

## Model Training

In [123]:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score

In [124]:
models = {
    "LinearRegression" : LinearRegression(),
    "SVM" : SVR(),
    "KNN" : KNeighborsRegressor(),
    "DecisionTree" : DecisionTreeRegressor(), 
    "RandomForest" : RandomForestRegressor(),
    "Adaboost" : AdaBoostRegressor(), 
    "Gradient" : GradientBoostingRegressor(), 
    "XGboost" : XGBRegressor()
}
models

{'LinearRegression': LinearRegression(),
 'SVM': SVR(),
 'KNN': KNeighborsRegressor(),
 'DecisionTree': DecisionTreeRegressor(),
 'RandomForest': RandomForestRegressor(),
 'Adaboost': AdaBoostRegressor(),
 'Gradient': GradientBoostingRegressor(),
 'XGboost': XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parall

In [129]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train_scaled, y_train)

    y_pred_train = model.predict(X_train_scaled)
    y_pred_test = model.predict(X_test_scaled)
    print(f"{list(models.keys())[i]} training accuracy : {r2_score(y_train, y_pred_train) * 100}")
    print(f"{list(models.keys())[i]} test accuracy : {r2_score(y_test, y_pred_test) * 100}")
    print("-" * 30)

LinearRegression training accuracy : 90.49453218243147
LinearRegression test accuracy : 75.2815421583293
------------------------------
SVM training accuracy : 66.07201826030422
SVM test accuracy : 61.229534438916346
------------------------------
KNN training accuracy : 91.34334683616224
KNN test accuracy : 83.27262379065023
------------------------------
DecisionTree training accuracy : 100.0
DecisionTree test accuracy : 83.65400301156203
------------------------------
RandomForest training accuracy : 98.0509409190436
RandomForest test accuracy : 50.20577817036569
------------------------------
Adaboost training accuracy : 97.00167338275814
Adaboost test accuracy : 62.97679152303917
------------------------------
Gradient training accuracy : 99.66774927163206
Gradient test accuracy : 72.89408107274089
------------------------------
XGboost training accuracy : 99.99945593968305
XGboost test accuracy : 79.12929582596504
------------------------------


In [None]:
## Best preforming models 