In [1]:
import warnings
warnings.filterwarnings



# Packages

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Business and Data Understanding

## Business interest

The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors.

The dataset can be used by car dealership business, which helps to sell and buy used cars.

### Business Objective

 - In order to have a high commission from the sold cars, business is interested in selling cars for **a high prices**.
 - Additionally business is interested in **a big amount of deals** which increases overall profit.

### Business Constraint

However if you will be selling cars for unreasonably **high prices** you will get horrible reputation which affects **the amounts of deals**.

### Success Criteria

Satisfactory criteria from buyers and sellers of cars. The priority is given with respect to the context (e.g. we have a lot of sellers and very few buyers)

## Exploratory Data Analysis (EDA)

### Data Collection

In [33]:
cars = pd.read_csv("https://raw.githubusercontent.com/DIG-Placements/Datasets/main/Car%20Dataset.csv")

- This data is collected from 'Car Dekho'.
- Secondary Data Source

Following details of cars are included in the dataset:

1) Car name
2) Year
3) Selling Price
4) Kms driven
5) Fuel
6) Seller type
7) Transmission
8) Owner

### Data Analysis

In [34]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4340 entries, 0 to 4339
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           4340 non-null   object
 1   year           4340 non-null   int64 
 2   selling_price  4340 non-null   int64 
 3   km_driven      4340 non-null   int64 
 4   fuel           4340 non-null   object
 5   seller_type    4340 non-null   object
 6   transmission   4340 non-null   object
 7   owner          4340 non-null   object
dtypes: int64(3), object(5)
memory usage: 271.4+ KB


- Structured Data
- Cross-Sectional Data

Variable of interest is **price**.

In [75]:
target = ['selling_price']
continuousFeatures = ['km_driven', 'year']

nominalFeatures = ['name', 'fuel', 'transmission', 'seller_type']
ordinalFeatures = ['owner']

In [36]:
cars.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [37]:
cars.shape

(4340, 8)

# 2. Data Preparation

## Feature engineering 0

We have too much unique values in presumably categorical column

In [39]:
cars.name.nunique()

1491

Therefore we can Extract the first word of the `name` column, which most probably will be brand of the car that affects `selling_price`.

In [40]:
cars['brand'] = cars.name.str.split().str[0]

After that we can drop `name` column to avoid overfitting.

In [43]:
cars.drop('name', axis=1, inplace=True)
cars.head(2)

Unnamed: 0,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand
0,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti
1,2007,135000,50000,Petrol,Individual,Manual,First Owner,Maruti


In [81]:
nominalFeatures = ['brand', 'fuel', 'transmission', 'seller_type']

## Exploratory Data Analysis (EDA)

### First Moment Business Decision

#### Overall

In [63]:
cars[continuousFeatures].agg(['mean', 'median'])

Unnamed: 0,km_driven,year
mean,66215.777419,2013.090783
median,60000.0,2014.0


- `km_driven` is right skewed
- `year` is slightly left skewed

In [60]:
ownerDict = {
    'First Owner': 1, 
    'Second Owner': 2, 
    'Fourth & Above Owner': 4,
    'Third Owner': 3, 
    'Test Drive Car': 0
}
cars.owner.map(lambda x: ownerDict[x]).agg(['mean', 'median'])

mean      1.447005
median    1.000000
Name: owner, dtype: float64

In [94]:
cars.selling_price.agg(['mean', 'median'])

mean      504127.311751
median    350000.000000
Name: selling_price, dtype: float64

The target variable is right skewed.

In [82]:
cars[ordinalFeatures+nominalFeatures].mode()

Unnamed: 0,owner,brand,fuel,transmission,seller_type
0,First Owner,Maruti,Diesel,Manual,Individual


In [69]:
cars.owner.value_counts()

owner
First Owner             2832
Second Owner            1106
Third Owner              304
Fourth & Above Owner      81
Test Drive Car            17
Name: count, dtype: int64

In [72]:
cars.brand.value_counts()[:2]

brand
Maruti     1280
Hyundai     821
Name: count, dtype: int64

In [73]:
cars.fuel.value_counts()

fuel
Diesel      2153
Petrol      2123
CNG           40
LPG           23
Electric       1
Name: count, dtype: int64

In [74]:
cars.transmission.value_counts()

transmission
Manual       3892
Automatic     448
Name: count, dtype: int64

We have unimodal categorical variables

**Overall Conclusions:**

- `km_driven` is right skewed
- `year` is slightly left skewed
- the target (`selling_price`) variable is right skewed
- unimodal categorical variables

#### By `owner`

In [88]:
cars[continuousFeatures+['owner']].groupby('owner').agg(['mean', 'median'])

Unnamed: 0_level_0,km_driven,km_driven,year,year
Unnamed: 0_level_1,mean,median,mean,median
owner,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
First Owner,56015.009887,50000.0,2014.440678,2015.0
Fourth & Above Owner,99138.135802,90000.0,2007.395062,2008.0
Second Owner,81783.518987,80000.0,2010.983725,2012.0
Test Drive Car,4155.0,1010.0,2019.529412,2020.0
Third Owner,99304.506579,90000.0,2009.338816,2010.0


- `km_driven` is positively correlated with `owner`
- `year` is negatively correlated with `owner`

In [96]:
cars[['owner', 'selling_price']].groupby('owner').agg(['mean', 'median'])

Unnamed: 0_level_0,selling_price,selling_price
Unnamed: 0_level_1,mean,median
owner,Unnamed: 1_level_2,Unnamed: 2_level_2
First Owner,598636.969633,450000.0
Fourth & Above Owner,173901.197531,130000.0
Second Owner,343891.088608,250499.5
Test Drive Car,954293.941176,894999.0
Third Owner,269474.003289,190000.0


The `owner` fields seemas like affects the target variably significantly

In [93]:
cars[nominalFeatures+['owner']].groupby('owner').agg(pd.Series.mode)

Unnamed: 0_level_0,brand,fuel,transmission,seller_type
owner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
First Owner,Maruti,Diesel,Manual,Individual
Fourth & Above Owner,Maruti,Petrol,Manual,Individual
Second Owner,Maruti,Diesel,Manual,Individual
Test Drive Car,Ford,Petrol,Manual,Dealer
Third Owner,Maruti,Diesel,Manual,Individual


In [97]:
cars.owner.value_counts()

owner
First Owner             2832
Second Owner            1106
Third Owner              304
Fourth & Above Owner      81
Test Drive Car            17
Name: count, dtype: int64

**Overall Conclusions:**

- `km_driven` is positively correlated with `owner`
- `year` is negatively correlated with `owner`
- the target variable (`selling_price`) is noticeably negatively correlated with `owner`

#### By `fuel`

In [98]:
cars[continuousFeatures+['fuel']].groupby('fuel').agg(['mean', 'median'])

Unnamed: 0_level_0,km_driven,km_driven,year,year
Unnamed: 0_level_1,mean,median,mean,median
fuel,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
CNG,67234.75,71500.0,2013.475,2013.0
Diesel,79630.977706,72000.0,2013.606595,2014.0
Electric,62000.0,62000.0,2006.0,2006.0
LPG,89634.782609,90000.0,2010.130435,2011.0
Petrol,52340.079604,50000.0,2012.595855,2014.0


In [100]:
cars[['fuel', 'selling_price']].groupby('fuel').agg(['mean', 'median'])

Unnamed: 0_level_0,selling_price,selling_price
Unnamed: 0_level_1,mean,median
fuel,Unnamed: 1_level_2,Unnamed: 2_level_2
CNG,277174.925,247500.0
Diesel,669094.252206,500000.0
Electric,310000.0,310000.0
LPG,167826.043478,180000.0
Petrol,344840.137541,269000.0


Diesel is most expensive while LPG is cheapest

In [101]:
cars[nominalFeatures].groupby('fuel').agg(pd.Series.mode)

Unnamed: 0_level_0,brand,transmission,seller_type
fuel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CNG,Maruti,Manual,Individual
Diesel,Maruti,Manual,Individual
Electric,Toyota,Automatic,Dealer
LPG,Maruti,Manual,Individual
Petrol,Maruti,Manual,Individual


In [99]:
cars.fuel.value_counts()

fuel
Diesel      2153
Petrol      2123
CNG           40
LPG           23
Electric       1
Name: count, dtype: int64

**Overall Conclusions**

- Diesel is most expensive while LPG is cheapest

#### By `transmission`

In [104]:
cars[continuousFeatures+['transmission']].groupby('transmission').agg(['mean', 'median'])

Unnamed: 0_level_0,km_driven,km_driven,year,year
Unnamed: 0_level_1,mean,median,mean,median
transmission,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Automatic,49688.803571,41210.0,2014.877232,2016.0
Manual,68118.162898,60000.0,2012.885149,2013.0


Automatic cars drive more `km_driven` than Manual

In [105]:
cars[['transmission', 'selling_price']].groupby('transmission').agg(['mean', 'median'])

Unnamed: 0_level_0,selling_price,selling_price
Unnamed: 0_level_1,mean,median
transmission,Unnamed: 1_level_2,Unnamed: 2_level_2
Automatic,1408154.0,950000.0
Manual,400066.7,325000.0


Automatic cars are about 3 times more expensive than Manual

In [106]:
cars[nominalFeatures].groupby('transmission').agg(pd.Series.mode)

Unnamed: 0_level_0,brand,fuel,seller_type
transmission,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Automatic,Hyundai,Diesel,Dealer
Manual,Maruti,Petrol,Individual


In [103]:
cars.transmission.value_counts()

transmission
Manual       3892
Automatic     448
Name: count, dtype: int64

**Overall Conclusions**

- Automatic cars drive more `km_driven` than Manual
- Automatic cars are about 3 times more expensive than Manual

#### By `seller_type`

In [108]:
cars[continuousFeatures+['seller_type']].groupby('seller_type').agg(['mean', 'median'])

Unnamed: 0_level_0,km_driven,km_driven,year,year
Unnamed: 0_level_1,mean,median,mean,median
seller_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Dealer,52827.259557,49000.0,2014.200201,2015.0
Individual,71167.556104,70000.0,2012.665228,2013.0
Trustmark Dealer,39202.215686,46507.0,2015.813725,2016.0


In [109]:
cars[['seller_type', 'selling_price']].groupby('seller_type').agg(['mean', 'median'])

Unnamed: 0_level_0,selling_price,selling_price
Unnamed: 0_level_1,mean,median
seller_type,Unnamed: 1_level_2,Unnamed: 2_level_2
Dealer,721822.890342,495000.0
Individual,424505.419236,300000.0
Trustmark Dealer,914950.980392,750000.0


Cars are more expensive in Trustmark Dealer and Dealer than Individual

In [110]:
cars[nominalFeatures].groupby('seller_type').agg(pd.Series.mode)

Unnamed: 0_level_0,brand,fuel,transmission
seller_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dealer,Hyundai,Diesel,Manual
Individual,Maruti,Petrol,Manual
Trustmark Dealer,Maruti,Petrol,Manual


In [111]:
cars.seller_type.value_counts()

seller_type
Individual          3244
Dealer               994
Trustmark Dealer     102
Name: count, dtype: int64

**Overall Conclusions**

- Cars are more expensive in Trustmark Dealer and Dealer than Individual

In [87]:
ordinalFeatures+nominalFeatures

['owner', 'brand', 'fuel', 'transmission', 'seller_type']

### Second Moment Business Decision

In [83]:
cars[continuousFeatures].agg(['var', 'std'])

Unnamed: 0,km_driven,year
var,2175672000.0,17.769125
std,46644.1,4.215344


In [84]:
cars[continuousFeatures].max()-cars[continuousFeatures].min()

km_driven    806598
year             28
dtype: int64

### Third Moment Business Decision

In [85]:
cars[continuousFeatures].skew()

km_driven    2.669057
year        -0.833240
dtype: float64

As mentioned before:
- `km_driven` -- right skewed
- `year` -- slightly left skewed

### Fourth Moment Business Decision

In [86]:
cars[continuousFeatures].kurt()

km_driven    23.316809
year          0.668263
dtype: float64

Positive Kurtosis (Leptokutic Distribution):
- Sharp Peak
- Thick Tails

### Graphical Representations

## Data Preprocessing

### Data Cleaning

### Data Wrangling

## Feature engineering

In [None]:
cars['brand'] = cars.name.str.split().str[0]

In [19]:
cars.drop('name', axis=1, inplace=True)

# 3. Model Building

# 4. Evaluation

# 5. Model Deployment

# 6. Monitoring and Maintenance 

# Resources

1. [Mind Map](https://360digitmg.com/mindmap/data-science)
2. [Kaggle information about dataset - Car Dekho](https://www.kaggle.com/datasets/akshaydattatraykhare/car-details-dataset)