## Importing the Library and the Dataset

In [1]:
import pandas as pd # For reading the data and manipulating it
import numpy as np # For mathematical computations.

import matplotlib.pyplot as plt # For plotting 
import seaborn as sns # For plotting 

In [2]:
cars = pd.read_excel('Data_Train.xlsx') # reading the training file into cars dataframe

cars.sample(5) # viewing 5 random samples of the data

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
2535,Maruti Swift Dzire Vdi BSIV,Hyderabad,2016,69000,Diesel,Manual,First,19.3 kmpl,1248 CC,73.9 bhp,5.0,,4.79
4606,Maruti Zen LXi - BS III,Hyderabad,2006,80000,Petrol,Manual,Second,17.3 kmpl,993 CC,60 bhp,5.0,,1.2
5075,Mercedes-Benz New C-Class C 200 Kompressor Ele...,Kolkata,2007,43627,Petrol,Automatic,First,11.74 kmpl,1796 CC,186 bhp,5.0,,29.0
5352,Maruti Ciaz VXi Plus,Mumbai,2015,41000,Petrol,Manual,First,20.73 kmpl,1373 CC,91.1 bhp,5.0,,5.85
3322,Honda Amaze VX AT i-Vtech,Delhi,2016,19000,Petrol,Automatic,First,15.5 kmpl,1198 CC,86.7 bhp,5.0,,6.25


## Prepairing the Dataset for Analysis

#### Info about the `cars` dataframe

In [3]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 13 columns):
Name                 6019 non-null object
Location             6019 non-null object
Year                 6019 non-null int64
Kilometers_Driven    6019 non-null int64
Fuel_Type            6019 non-null object
Transmission         6019 non-null object
Owner_Type           6019 non-null object
Mileage              6017 non-null object
Engine               5983 non-null object
Power                5983 non-null object
Seats                5977 non-null float64
New_Price            824 non-null object
Price                6019 non-null float64
dtypes: float64(2), int64(2), object(9)
memory usage: 611.4+ KB


* Our `cars` data has **6019** rows and **13** columns. And right now it seems like :
* __Mileage__ has **2** missing values.
* __Engine__ & __Power__ have  **36** missing values
* __Seats has__ **47** missing values.
* __New_Price__ has **5195** missing values, which means that **87%** rows have its value missing. And hence we will drop it later. 

#### Name

In [4]:
print('No. of unique enteries : ', len(cars.Name.unique()))

No. of unique enteries :  1876


The name of the car follows the pattern -> [Brand Name] [Car Name]    
Also I believe that cars from certain brands will have higher resale value than other brands. 
So, I am creating a new col `Brand` from the `Name` col.

In [5]:
cars['Brand'] = [name.split(' ')[0].capitalize() for name in cars.Name]

cars.sample(3)

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,Brand
290,Tata Zest Quadrajet 1.3 75PS XE,Pune,2018,45000,Diesel,Manual,First,22.95 kmpl,1248 CC,74 bhp,5.0,8.34 Lakh,5.25,Tata
1734,Maruti Swift Dzire VDI,Kolkata,2014,32000,Diesel,Manual,First,23.4 kmpl,1248 CC,74 bhp,5.0,,4.35,Maruti
5186,Volkswagen Vento Petrol Highline AT,Mumbai,2011,45000,Petrol,Automatic,First,14.4 kmpl,1598 CC,103.6 bhp,5.0,,3.35,Volkswagen


#### Brand

Available brands of cars in the dataset

In [6]:
print(cars.Brand.value_counts().index.values)

['Maruti' 'Hyundai' 'Honda' 'Toyota' 'Mercedes-benz' 'Volkswagen' 'Ford'
 'Mahindra' 'Bmw' 'Audi' 'Tata' 'Skoda' 'Renault' 'Chevrolet' 'Nissan'
 'Land' 'Jaguar' 'Fiat' 'Mitsubishi' 'Mini' 'Volvo' 'Porsche' 'Jeep'
 'Datsun' 'Force' 'Isuzu' 'Ambassador' 'Bentley' 'Lamborghini' 'Smart']


Replacing `Land` by `Land Rover`

In [7]:
cars.Brand.replace('Land', 'Land Rover', inplace = True)

print(cars.Brand.value_counts().index.values)

['Maruti' 'Hyundai' 'Honda' 'Toyota' 'Mercedes-benz' 'Volkswagen' 'Ford'
 'Mahindra' 'Bmw' 'Audi' 'Tata' 'Skoda' 'Renault' 'Chevrolet' 'Nissan'
 'Land Rover' 'Jaguar' 'Fiat' 'Mitsubishi' 'Mini' 'Volvo' 'Porsche' 'Jeep'
 'Datsun' 'Force' 'Isuzu' 'Ambassador' 'Bentley' 'Lamborghini' 'Smart']


#### Location

In [8]:
print(cars.Location.value_counts().index.values)

['Mumbai' 'Hyderabad' 'Kochi' 'Coimbatore' 'Pune' 'Delhi' 'Kolkata'
 'Chennai' 'Jaipur' 'Bangalore' 'Ahmedabad']


All good here !

#### Year

In [9]:
print(cars.Year.value_counts().index.values)

[2014 2015 2016 2013 2017 2012 2011 2010 2018 2009 2008 2007 2019 2006
 2005 2004 2003 2002 2001 1998 2000 1999]


The oldest car is of year **1999**  and the latest is from **2019**, hence cars from a range of 20 years.

#### Kilometers_Driven

I will first rename the header name from **Kilometers_Driven** to **Km** for easiness.

In [10]:
cars.rename(columns = {
    'Kilometers_Driven' : 'Km',
}, inplace = True)

In [11]:
cars.Km.describe()

count    6.019000e+03
mean     5.873838e+04
std      9.126884e+04
min      1.710000e+02
25%      3.400000e+04
50%      5.300000e+04
75%      7.300000e+04
max      6.500000e+06
Name: Km, dtype: float64

#### Fuel_Type

I will first rename the header name from **Fuel_Type** to **Fuel** for easiness.

In [12]:
cars.rename(columns = {
    'Fuel_Type' : 'Fuel',
}, inplace = True)

In [13]:
cars.Fuel.value_counts()

Diesel      3205
Petrol      2746
CNG           56
LPG           10
Electric       2
Name: Fuel, dtype: int64

#### Transmission

In [14]:
cars.Transmission.value_counts()

Manual       4299
Automatic    1720
Name: Transmission, dtype: int64

Nothing to change here !

#### Owner_Type

In [15]:
cars.rename(columns = {
    'Owner_Type' : 'Owner'
}, inplace = True)

In [16]:
cars.Owner.value_counts()

First             4929
Second             968
Third              113
Fourth & Above       9
Name: Owner, dtype: int64

In [17]:
cars.Owner.replace({
    'First' : 1,
    'Second' : 2,
    'Third' : 3,
    'Fourth & Above' : 4,
}, inplace = True)

cars.sample(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
4354,Toyota Etios Liva GD,Delhi,2012,72351,Diesel,Manual,1,23.59 kmpl,1364 CC,null bhp,5.0,,2.65,Toyota
5191,Ford Endeavour 2.2 Titanium AT 4X2,Chennai,2019,9000,Diesel,Automatic,1,12.62 kmpl,2198 CC,158 bhp,7.0,,32.9,Ford
2439,Honda City 1.5 S MT,Kolkata,2010,38446,Petrol,Manual,1,17.0 kmpl,1497 CC,118 bhp,5.0,,2.59,Honda


In [18]:
cars.Owner.value_counts()

1    4929
2     968
3     113
4       9
Name: Owner, dtype: int64

#### Mileage

In [19]:
cars.Mileage = cars.Mileage.str.strip('kmpl km/kg')

In [20]:
cars.Mileage = cars.Mileage.apply(pd.to_numeric)

cars.dtypes

Name             object
Location         object
Year              int64
Km                int64
Fuel             object
Transmission     object
Owner             int64
Mileage         float64
Engine           object
Power            object
Seats           float64
New_Price        object
Price           float64
Brand            object
dtype: object

In [21]:
cars.Mileage.describe()

count    6017.000000
mean       18.134961
std         4.582289
min         0.000000
25%        15.170000
50%        18.150000
75%        21.100000
max        33.540000
Name: Mileage, dtype: float64

In [22]:
len(cars[cars.Mileage == 0])

68

In [23]:
cars.Mileage.replace([0, np.nan], [cars.Mileage.mean(), cars.Mileage.mean()], inplace = True)

#### Engine

In [24]:
cars.Engine = cars.Engine.str.strip('CC')
cars.Engine = cars.Engine.apply(pd.to_numeric)

In [25]:
cars.Engine.describe()

count    5983.000000
mean     1621.276450
std       601.355233
min        72.000000
25%      1198.000000
50%      1493.000000
75%      1984.000000
max      5998.000000
Name: Engine, dtype: float64

In [28]:
cars.head(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,1,26.6,998.0,58.16 bhp,5.0,,1.75,Maruti
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,1,19.67,1582.0,126.2 bhp,5.0,,12.5,Hyundai
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,1,18.2,1199.0,88.7 bhp,5.0,8.61 Lakh,4.5,Honda


#### Power

In [29]:
cars.Power.describe()

count       5983
unique       372
top       74 bhp
freq         235
Name: Power, dtype: object

In [30]:
cars.Power = cars.Power.astype(str) # converting to string
cars.Power = [power[: -4] for power in cars.Power] # removing bhp
cars.Power = pd.to_numeric(cars.Power, errors='coerce') # converting to float

##### Dropping rows with missing values in Power n Engine

In [38]:
cars.dropna(subset=['Power', 'Engine'], inplace = True)

In [41]:
cars.head(5)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,1,26.6,998.0,58.16,5.0,,1.75,Maruti
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,1,19.67,1582.0,126.2,5.0,,12.5,Hyundai
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,1,18.2,1199.0,88.7,5.0,8.61 Lakh,4.5,Honda
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,1,20.77,1248.0,88.76,7.0,,6.0,Maruti
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,2,15.2,1968.0,140.8,5.0,,17.74,Audi


#### New_Price

As mentioned earlier, New_Price has 87% of rows with missing data and hence I will drop this col from the dataset.

In [43]:
cars.drop(columns = 'New_Price', inplace = True)
cars.sample(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,Price,Brand
5067,Land Rover Discovery Sport SD4 HSE Luxury 7S,Coimbatore,2019,17201,Diesel,Automatic,1,12.51,2179.0,187.7,7.0,58.91,Land Rover
1213,Ford Figo Diesel Titanium,Mumbai,2014,54000,Diesel,Manual,1,20.0,1399.0,68.05,5.0,3.25,Ford
2373,Volkswagen Polo Diesel Comfortline 1.2L,Delhi,2012,74000,Diesel,Manual,2,22.07,1199.0,73.9,5.0,2.65,Volkswagen


In [44]:
cars.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,5876.0,2013.476515,3.165822,1998.0,2012.0,2014.0,2016.0,2019.0
Km,5876.0,58320.261232,92139.230505,171.0,33443.75,52609.0,72402.75,6500000.0
Owner,5876.0,1.195541,0.445941,1.0,1.0,1.0,1.0,4.0
Mileage,5876.0,18.363283,4.177478,6.4,15.3,18.2,21.1,33.54
Engine,5876.0,1625.466133,601.787379,72.0,1198.0,1495.5,1991.0,5998.0
Power,5876.0,113.25305,53.874957,34.2,75.0,97.7,138.1,560.0
Seats,5874.0,5.283623,0.804961,2.0,5.0,5.0,5.0,10.0
Price,5876.0,9.602665,11.246531,0.44,3.5175,5.75,10.0125,160.0
