## Importing the Library and the Dataset

In [1]:
import pandas as pd # For reading the data and manipulating it
import numpy as np # For mathematical computations.

import matplotlib.pyplot as plt # For plotting 
import seaborn as sns # For plotting 

In [2]:
cars = pd.read_excel('Data_Train.xlsx') # reading the training file into cars dataframe

cars.sample(5) # viewing 5 random samples of the data

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
2830,Chevrolet Sail Hatchback LS ABS,Kolkata,2013,45000,Diesel,Manual,First,22.1 kmpl,1248 CC,76.9 bhp,5.0,,3.0
2431,Toyota Fortuner 3.0 Diesel,Mumbai,2010,122000,Diesel,Manual,Second,11.5 kmpl,2982 CC,171 bhp,7.0,,10.45
3507,Mitsubishi Pajero 4X4 LHD,Bangalore,2012,155566,Diesel,Manual,Second,9.5 kmpl,2835 CC,118.6 bhp,6.0,,9.15
1166,Hyundai Xcent 1.2 Kappa SX,Kolkata,2016,32000,Petrol,Manual,First,19.1 kmpl,1197 CC,82 bhp,5.0,,3.6
1714,Ford Figo Diesel EXI,Ahmedabad,2012,55005,Diesel,Manual,First,20.0 kmpl,1399 CC,68 bhp,5.0,,2.75


## Prepairing the Dataset for Analysis

#### Info about the `cars` dataframe

In [3]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 13 columns):
Name                 6019 non-null object
Location             6019 non-null object
Year                 6019 non-null int64
Kilometers_Driven    6019 non-null int64
Fuel_Type            6019 non-null object
Transmission         6019 non-null object
Owner_Type           6019 non-null object
Mileage              6017 non-null object
Engine               5983 non-null object
Power                5983 non-null object
Seats                5977 non-null float64
New_Price            824 non-null object
Price                6019 non-null float64
dtypes: float64(2), int64(2), object(9)
memory usage: 611.4+ KB


* Our `cars` data has **6019** rows and **13** columns. And right now it seems like :
* __Mileage__ has **2** missing values.
* __Engine__ & __Power__ have  **36** missing values
* __Seats has__ **47** missing values.
* __New_Price__ has **5195** missing values, which means that **87%** rows have its value missing. And hence we will drop it later. 

#### Name

In [4]:
print('No. of unique enteries : ', len(cars.Name.unique()))

No. of unique enteries :  1876


The name of the car follows the pattern -> [Brand Name] [Car Name]    
Also I believe that cars from certain brands will have higher resale value than other brands. 
So, I am creating a new col `Brand` from the `Name` col.

In [5]:
cars['Brand'] = [name.split(' ')[0].capitalize() for name in cars.Name]

cars.sample(3)

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,Brand
2315,Maruti Swift VXI BSIV,Kolkata,2017,13000,Petrol,Manual,First,20.4 kmpl,1197 CC,81.80 bhp,5.0,,4.65,Maruti
55,Volkswagen Vento 2013-2015 1.6 Comfortline,Kolkata,2015,39000,Petrol,Manual,First,15.04 kmpl,1598 CC,103.2 bhp,5.0,,3.99,Volkswagen
2382,Mercedes-Benz M-Class ML 250 CDI,Coimbatore,2016,23218,Diesel,Automatic,First,15.26 kmpl,2143 CC,203.2 bhp,5.0,,45.61,Mercedes-benz


#### Brand

Available brands of cars in the dataset

In [6]:
print(cars.Brand.value_counts().index.values)

['Maruti' 'Hyundai' 'Honda' 'Toyota' 'Mercedes-benz' 'Volkswagen' 'Ford'
 'Mahindra' 'Bmw' 'Audi' 'Tata' 'Skoda' 'Renault' 'Chevrolet' 'Nissan'
 'Land' 'Jaguar' 'Fiat' 'Mitsubishi' 'Mini' 'Volvo' 'Porsche' 'Jeep'
 'Datsun' 'Force' 'Isuzu' 'Bentley' 'Ambassador' 'Smart' 'Lamborghini']


Replacing `Land` by `Land Rover`

In [7]:
cars.Brand.replace('Land', 'Land Rover', inplace = True)

print(cars.Brand.value_counts().index.values)

['Maruti' 'Hyundai' 'Honda' 'Toyota' 'Mercedes-benz' 'Volkswagen' 'Ford'
 'Mahindra' 'Bmw' 'Audi' 'Tata' 'Skoda' 'Renault' 'Chevrolet' 'Nissan'
 'Land Rover' 'Jaguar' 'Fiat' 'Mitsubishi' 'Mini' 'Volvo' 'Porsche' 'Jeep'
 'Datsun' 'Force' 'Isuzu' 'Bentley' 'Ambassador' 'Smart' 'Lamborghini']


#### Location

In [8]:
print(cars.Location.value_counts().index.values)

['Mumbai' 'Hyderabad' 'Kochi' 'Coimbatore' 'Pune' 'Delhi' 'Kolkata'
 'Chennai' 'Jaipur' 'Bangalore' 'Ahmedabad']


All good here !

#### Year

In [9]:
print(cars.Year.value_counts().index.values)

[2014 2015 2016 2013 2017 2012 2011 2010 2018 2009 2008 2007 2019 2006
 2005 2004 2003 2002 2001 1998 2000 1999]


The oldest car is of year **1999**  and the latest is from **2019**, hence cars from a range of 20 years.

#### Kilometers_Driven

I will first rename the header name from **Kilometers_Driven** to **Km** for easiness.

In [10]:
cars.rename(columns = {
    'Kilometers_Driven' : 'Km',
}, inplace = True)

In [11]:
cars.Km.describe()

count    6.019000e+03
mean     5.873838e+04
std      9.126884e+04
min      1.710000e+02
25%      3.400000e+04
50%      5.300000e+04
75%      7.300000e+04
max      6.500000e+06
Name: Km, dtype: float64

#### Fuel_Type

I will first rename the header name from **Fuel_Type** to **Fuel** for easiness.

In [12]:
cars.rename(columns = {
    'Fuel_Type' : 'Fuel',
}, inplace = True)

In [13]:
cars.Fuel.value_counts()

Diesel      3205
Petrol      2746
CNG           56
LPG           10
Electric       2
Name: Fuel, dtype: int64

#### Transmission

In [14]:
cars.Transmission.value_counts()

Manual       4299
Automatic    1720
Name: Transmission, dtype: int64

Nothing to change here !

#### Owner_Type

In [15]:
cars.rename(columns = {
    'Owner_Type' : 'Owner'
}, inplace = True)

In [16]:
cars.Owner.value_counts()

First             4929
Second             968
Third              113
Fourth & Above       9
Name: Owner, dtype: int64

In [17]:
cars.Owner.replace({
    'First' : 1,
    'Second' : 2,
    'Third' : 3,
    'Fourth & Above' : 4,
}, inplace = True)

cars.sample(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
3590,BMW 5 Series 2013-2017 530d M Sport,Kochi,2015,65266,Diesel,Automatic,1,14.69 kmpl,2993 CC,258 bhp,5.0,,31.75,Bmw
3680,Skoda Superb 1.8 TSI MT,Kolkata,2011,54895,Petrol,Manual,1,13.14 kmpl,1798 CC,160 bhp,5.0,,7.5,Skoda
2903,Hyundai EON Era Plus,Pune,2013,22000,Petrol,Manual,1,21.1 kmpl,814 CC,55.2 bhp,5.0,,2.6,Hyundai


In [18]:
cars.Owner.value_counts()

1    4929
2     968
3     113
4       9
Name: Owner, dtype: int64

#### Mileage

In [34]:
cars.Mileage = cars.Mileage.str.strip('kmpl km/kg')

In [36]:
cars.Mileage = cars.Mileage.apply(pd.to_numeric)

cars.dtypes

Name             object
Location         object
Year              int64
Km                int64
Fuel             object
Transmission     object
Owner             int64
Mileage         float64
Engine           object
Power            object
Seats           float64
New_Price        object
Price           float64
Brand            object
dtype: object

In [42]:
cars.Mileage.describe()

count    6017.000000
mean       18.134961
std         4.582289
min         0.000000
25%        15.170000
50%        18.150000
75%        21.100000
max        33.540000
Name: Mileage, dtype: float64

In [41]:
len(cars[cars.Mileage == 0])

68

In [48]:
cars.Mileage.replace([0, np.nan], [cars.Mileage.mean(), cars.Mileage.mean()], inplace = True)

#### Engine

In [53]:
cars.Engine = cars.Engine.str.strip('CC')
cars.Engine = cars.Engine.apply(pd.to_numeric)

In [58]:
cars.Engine.describe()

count    5983.000000
mean     1621.276450
std       601.355233
min        72.000000
25%      1198.000000
50%      1493.000000
75%      1984.000000
max      5998.000000
Name: Engine, dtype: float64

In [61]:
cars.Engine.replace(np.nan, 1984, inplace = True)

In [62]:
cars.Engine.describe()

count    6019.000000
mean     1623.445921
std       600.205947
min        72.000000
25%      1198.000000
50%      1493.000000
75%      1984.000000
max      5998.000000
Name: Engine, dtype: float64