## Importing the Library and the Dataset

In [1]:
import pandas as pd # For reading the data and manipulating it
import numpy as np # For mathematical computations.

import matplotlib.pyplot as plt # For plotting 
import seaborn as sns # For plotting 

In [2]:
cars = pd.read_excel('Data_Train.xlsx') # reading the training file into cars dataframe

cars.sample(5) # viewing 5 random samples of the data

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
1993,Ford Figo Petrol ZXI,Bangalore,2011,68465,Petrol,Manual,First,15.6 kmpl,1196 CC,70 bhp,5.0,,2.95
4943,Skoda Rapid 1.6 MPI AT Style,Kochi,2018,21488,Petrol,Automatic,First,14.3 kmpl,1598 CC,103.52 bhp,5.0,,9.72
2978,Porsche Panamera 2010 2013 4S,Coimbatore,2010,42400,Petrol,Automatic,Third,8.0 kmpl,4806 CC,394.3 bhp,4.0,,42.91
2641,Hyundai i20 Asta 1.2,Coimbatore,2016,17560,Petrol,Manual,First,18.6 kmpl,1197 CC,81.83 bhp,5.0,,7.21
5490,Renault Fluence Diesel E4,Pune,2013,57000,Diesel,Manual,First,21.8 kmpl,1461 CC,78 bhp,5.0,,4.61


## Prepairing the Dataset for Analysis

#### Info about the `cars` dataframe

In [3]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 13 columns):
Name                 6019 non-null object
Location             6019 non-null object
Year                 6019 non-null int64
Kilometers_Driven    6019 non-null int64
Fuel_Type            6019 non-null object
Transmission         6019 non-null object
Owner_Type           6019 non-null object
Mileage              6017 non-null object
Engine               5983 non-null object
Power                5983 non-null object
Seats                5977 non-null float64
New_Price            824 non-null object
Price                6019 non-null float64
dtypes: float64(2), int64(2), object(9)
memory usage: 611.4+ KB


* Our `cars` data has **6019** rows and **13** columns. And right now it seems like :
* __Mileage__ has **2** missing values.
* __Engine__ & __Power__ have  **36** missing values
* __Seats has__ **47** missing values.
* __New_Price__ has **5195** missing values, which means that **87%** rows have its value missing. And hence we will drop it later. 

#### Name

In [4]:
print('No. of unique enteries : ', len(cars.Name.unique()))

No. of unique enteries :  1876


The name of the car follows the pattern -> [Brand Name] [Car Name]    
Also I believe that cars from certain brands will have higher resale value than other brands. 
So, I am creating a new col `Brand` from the `Name` col.

In [5]:
cars['Brand'] = [name.split(' ')[0].capitalize() for name in cars.Name] # Extracting Brand Name from car name.

cars.sample(3)

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,Brand
5229,BMW 3 Series 320d,Bangalore,2012,73000,Diesel,Automatic,First,22.69 kmpl,1995 CC,190 bhp,5.0,48.79 Lakh,12.5,Bmw
3103,Nissan Micra XL,Kolkata,2013,45000,Petrol,Manual,First,18.44 kmpl,1198 CC,75 bhp,5.0,,1.75,Nissan
5043,Volkswagen Polo 1.2 MPI Comfortline,Ahmedabad,2014,44000,Petrol,Manual,First,16.47 kmpl,1198 CC,74 bhp,5.0,,4.3,Volkswagen


#### Brand

Available brands of cars in the dataset

In [6]:
print(cars.Brand.value_counts().index.values) # List of brands available

['Maruti' 'Hyundai' 'Honda' 'Toyota' 'Mercedes-benz' 'Volkswagen' 'Ford'
 'Mahindra' 'Bmw' 'Audi' 'Tata' 'Skoda' 'Renault' 'Chevrolet' 'Nissan'
 'Land' 'Jaguar' 'Fiat' 'Mitsubishi' 'Mini' 'Volvo' 'Porsche' 'Jeep'
 'Datsun' 'Isuzu' 'Force' 'Smart' 'Ambassador' 'Lamborghini' 'Bentley']


Replacing `Land` by `Land Rover`

In [7]:
cars.Brand.replace('Land', 'Land Rover', inplace = True) # Correcting a spelling error

print(cars.Brand.value_counts().index.values)

['Maruti' 'Hyundai' 'Honda' 'Toyota' 'Mercedes-benz' 'Volkswagen' 'Ford'
 'Mahindra' 'Bmw' 'Audi' 'Tata' 'Skoda' 'Renault' 'Chevrolet' 'Nissan'
 'Land Rover' 'Jaguar' 'Fiat' 'Mitsubishi' 'Mini' 'Volvo' 'Porsche' 'Jeep'
 'Datsun' 'Isuzu' 'Force' 'Smart' 'Ambassador' 'Lamborghini' 'Bentley']


#### Location

In [8]:
print(cars.Location.value_counts().index.values) # List of cities

['Mumbai' 'Hyderabad' 'Kochi' 'Coimbatore' 'Pune' 'Delhi' 'Kolkata'
 'Chennai' 'Jaipur' 'Bangalore' 'Ahmedabad']


All good here !

#### Year

In [9]:
print(cars.Year.value_counts().index.values) # List of years

[2014 2015 2016 2013 2017 2012 2011 2010 2018 2009 2008 2007 2019 2006
 2005 2004 2003 2002 2001 1998 2000 1999]


The oldest car is of year **1999**  and the latest is from **2019**, hence cars from a range of 20 years.

#### Kilometers_Driven

I will first rename the header name from **Kilometers_Driven** to **Km** for easiness.

In [10]:
cars.rename(columns = {         
    'Kilometers_Driven' : 'Km',
}, inplace = True)

In [11]:
cars.Km.describe()

count    6.019000e+03
mean     5.873838e+04
std      9.126884e+04
min      1.710000e+02
25%      3.400000e+04
50%      5.300000e+04
75%      7.300000e+04
max      6.500000e+06
Name: Km, dtype: float64

#### Fuel_Type

I will first rename the header name from **Fuel_Type** to **Fuel** for easiness.

In [12]:
cars.rename(columns = {
    'Fuel_Type' : 'Fuel',
}, inplace = True)

In [13]:
cars.Fuel.value_counts()

Diesel      3205
Petrol      2746
CNG           56
LPG           10
Electric       2
Name: Fuel, dtype: int64

#### Transmission

In [14]:
cars.Transmission.value_counts()  # Viewing no. of auto cars and no. of manual cars.

Manual       4299
Automatic    1720
Name: Transmission, dtype: int64

Nothing to change here !

#### Owner_Type

In [15]:
cars.rename(columns = {
    'Owner_Type' : 'Owner'
}, inplace = True)

In [16]:
cars.Owner.value_counts()

First             4929
Second             968
Third              113
Fourth & Above       9
Name: Owner, dtype: int64

In [17]:
cars.sample(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
4940,Chevrolet Cruze LTZ,Hyderabad,2012,62005,Diesel,Manual,First,18.3 kmpl,1991 CC,147.9 bhp,5.0,,9.8,Chevrolet
5707,Hyundai i20 1.4 Sportz,Kochi,2018,40175,Diesel,Manual,Second,22.54 kmpl,1396 CC,88.73 bhp,5.0,,6.36,Hyundai
2554,Skoda Octavia Ambiente 1.9 TDI MT,Hyderabad,2002,99000,Diesel,Manual,Second,18.7 kmpl,1896 CC,66 bhp,5.0,,1.9,Skoda


#### Mileage

In [18]:
cars[cars.Fuel == 'LPG'][:2] # Viewing cars with fuel type LPG

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
5,Hyundai EON LPG Era Plus Option,Hyderabad,2012,75000,LPG,Manual,First,21.1 km/kg,814 CC,55.2 bhp,5.0,,2.35,Hyundai
936,Maruti Wagon R LXI LPG BSIV,Hyderabad,2012,72000,LPG,Manual,First,26.2 km/kg,998 CC,58.2 bhp,5.0,,2.85,Maruti


In [19]:
cars[cars.Fuel == 'CNG'][:2] # Viewing cars with fuel type CNG

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75,Maruti
127,Maruti Wagon R LXI CNG,Pune,2013,89900,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,3.25,Maruti


In [20]:
cars[cars.Fuel == 'Electric'] # Viewing cars with fuel type Electric

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
4446,Mahindra E Verito D4,Chennai,2016,50000,Electric,Automatic,First,,72 CC,41 bhp,5.0,13.58 Lakh,13.0,Mahindra
4904,Toyota Prius 2009-2016 Z4,Mumbai,2011,44000,Electric,Automatic,First,,1798 CC,73 bhp,5.0,,12.75,Toyota


In [21]:
cars[cars.Fuel == 'Diesel'][:2] # Viewing cars with fuel type Diesel

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5,Hyundai
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0,Maruti


In [22]:
cars[cars.Fuel == 'Petrol'][:2] # Viewing cars with fuel type Petrol

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Mileage,Engine,Power,Seats,New_Price,Price,Brand
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5,Honda
10,Maruti Ciaz Zeta,Kochi,2018,25692,Petrol,Manual,First,21.56 kmpl,1462 CC,103.25 bhp,5.0,10.65 Lakh,9.95,Maruti


##### Correcting the units of Mileage and converting dtype to floats.

In [23]:
def convert_unit(car):
    if (car.Fuel == 'Diesel' or car.Fuel == 'Petrol'):
        mileage = car.Mileage.strip(' kmpl')
    elif car.Fuel == 'CNG':
        mileage = car.Mileage.strip(' km/kg')
    elif car.Fuel == 'LPG':
        mileage = car.Mileage.strip(' km/kg')
    elif car.Fuel == 'Electric':
        if car.Brand == 'Mahindra':
            mileage = 110
        else :
            mileage = 24
    mileage = float(mileage)
    return mileage

In [24]:
cars['Mileage_Converted'] = cars.apply(convert_unit, axis = 1)

cars.drop(columns = 'Mileage', inplace = True)
cars.rename(columns = {'Mileage_Converted' : 'Mileage'}, inplace = True)

cars.sample(5)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Engine,Power,Seats,New_Price,Price,Brand,Mileage
6018,Chevrolet Beat Diesel,Hyderabad,2011,47000,Diesel,Manual,First,936 CC,57.6 bhp,5.0,,2.5,Chevrolet,25.44
5147,Mahindra Scorpio SLE BSIV,Pune,2012,99000,Diesel,Manual,Third,2179 CC,120 bhp,8.0,,6.0,Mahindra,12.05
3592,Honda Amaze E i-Vtech,Kochi,2015,51361,Petrol,Manual,First,1198 CC,86.7 bhp,5.0,,4.94,Honda,18.0
4210,Audi Q5 2.0 TDI Premium Plus,Bangalore,2014,53900,Diesel,Automatic,First,1968 CC,174.3 bhp,5.0,,29.0,Audi,14.16
4734,Volkswagen Vento 1.5 TDI Highline,Chennai,2011,121311,Diesel,Manual,First,1498 CC,108.5 bhp,5.0,14.32 Lakh,2.9,Volkswagen,20.64


In [25]:
cars.Mileage.describe()

count    6019.000000
mean       18.151198
std         4.732671
min         0.000000
25%        15.170000
50%        18.160000
75%        21.100000
max       110.000000
Name: Mileage, dtype: float64

In [26]:
print("No. of cars with mileage 0 (zero) :", len(cars[cars.Mileage == 0]))

No. of cars with mileage 0 (zero) : 68


I guess, these actually represent null values. So I will replace them with NaN.

In [27]:
cars.Mileage.replace(0, np.nan, inplace = True)

#### Engine

In [28]:
cars[cars.Engine.isnull()]

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Engine,Power,Seats,New_Price,Price,Brand,Mileage
194,Honda City 1.5 GXI,Ahmedabad,2007,60006,Petrol,Manual,First,,,,,2.95,Honda,
208,Maruti Swift 1.3 VXi,Kolkata,2010,42001,Petrol,Manual,First,,,,,2.11,Maruti,16.1
733,Maruti Swift 1.3 VXi,Chennai,2006,97800,Petrol,Manual,Third,,,,,1.75,Maruti,16.1
749,Land Rover Range Rover 3.0 D,Mumbai,2008,55001,Diesel,Automatic,Second,,,,,26.5,Land Rover,
1294,Honda City 1.3 DX,Delhi,2009,55005,Petrol,Manual,First,,,,,3.2,Honda,12.8
1327,Maruti Swift 1.3 ZXI,Hyderabad,2015,50295,Petrol,Manual,First,,,,,5.8,Maruti,16.1
1385,Honda City 1.5 GXI,Pune,2004,115000,Petrol,Manual,Second,,,,,1.5,Honda,
1460,Land Rover Range Rover Sport 2005 2012 Sport,Coimbatore,2008,69078,Petrol,Manual,First,,,,,40.88,Land Rover,
2074,Maruti Swift 1.3 LXI,Pune,2011,24255,Petrol,Manual,First,,,,,3.15,Maruti,16.1
2096,Hyundai Santro LP zipPlus,Coimbatore,2004,52146,Petrol,Manual,First,,,,,1.93,Hyundai,


##### The cars with NaN as value in Engine col have missing values for Power, Seats, New Price and some have missing values for Mileage as well, so I am removing these cars from the dataset

In [29]:
cars_to_drop_index = cars[cars.Engine.isnull()].index.values # indexes of cars with missing engine vals.

In [30]:
cars.drop(cars_to_drop_index, inplace = True) 

In [31]:
def remove_CC(car):
    engine = car.Engine
    if (engine[-2:] == 'CC') :
        engine = engine.strip('CC')
    engine = int(engine)
    return engine

In [32]:
cars['Engine_'] = cars.apply(remove_CC, axis = 1)

cars.drop(columns = 'Engine', inplace = True)
cars.rename(columns = {'Engine_' : 'Engine'}, inplace = True)

cars.sample(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Power,Seats,New_Price,Price,Brand,Mileage,Engine
2983,Honda City ZX GXi,Pune,2008,119000,Petrol,Manual,Second,78 bhp,5.0,,2.25,Honda,17.7,1497
3078,Skoda Rapid 1.6 MPI Ambition,Mumbai,2015,39822,Petrol,Manual,First,103.52 bhp,5.0,12.02 Lakh,4.68,Skoda,15.41,1598
2101,Mahindra Quanto C8,Kolkata,2013,36000,Diesel,Manual,First,100 bhp,7.0,,3.9,Mahindra,17.21,1493


In [33]:
cars.Engine.describe()

count    5983.000000
mean     1621.276450
std       601.355233
min        72.000000
25%      1198.000000
50%      1493.000000
75%      1984.000000
max      5998.000000
Name: Engine, dtype: float64

#### Power

In [34]:
cars.Power.describe()

count       5983
unique       372
top       74 bhp
freq         235
Name: Power, dtype: object

In [35]:
print('No. of null or missing values :', cars.Power.isnull().sum())

No. of null or missing values : 0


In [36]:
len(cars[(cars.Power == 'null bhp')])

107

##### Some of the cars have `null` bhp as power, I will replace that with np.nan and from rest will remove `bhp`.

In [37]:
def remove_bhp(car):
    bhp = car.Power
    if (bhp == 'null bhp') :
        bhp = np.nan
    elif (bhp[-3:] == 'bhp'):
        bhp = bhp.strip(' bhp')
        bhp = float(bhp)
    return bhp

In [38]:
cars['Power_'] = cars.apply(remove_bhp, axis = 1)

cars.drop(columns = 'Power', inplace = True)
cars.rename(columns = {'Power_' : 'Power'}, inplace = True)

cars.sample(3)

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Seats,New_Price,Price,Brand,Mileage,Engine,Power
720,Chevrolet Beat Diesel LT,Bangalore,2012,40700,Diesel,Manual,First,5.0,,2.65,Chevrolet,25.44,936,57.6
2127,Maruti Baleno Delta,Mumbai,2016,12573,Petrol,Manual,First,5.0,7.36 Lakh,5.8,Maruti,21.4,1197,83.1
2260,BMW 5 Series 520d Luxury Line,Ahmedabad,2012,95000,Diesel,Automatic,Second,5.0,63.71 Lakh,17.5,Bmw,22.48,1995,190.0


In [39]:
print('No. of null or missing values :', cars.Power.isnull().sum())

No. of null or missing values : 107


#### Seats

In [40]:
cars.Seats.describe()

count    5977.000000
mean        5.278735
std         0.808840
min         0.000000
25%         5.000000
50%         5.000000
75%         5.000000
max        10.000000
Name: Seats, dtype: float64

Min seats are 0 (zero)! 

In [41]:
cars[cars.Seats == 0]

Unnamed: 0,Name,Location,Year,Km,Fuel,Transmission,Owner,Seats,New_Price,Price,Brand,Mileage,Engine,Power
3999,Audi A4 3.2 FSI Tiptronic Quattro,Hyderabad,2012,125000,Petrol,Automatic,First,0.0,,18.0,Audi,10.5,3197,


Since there is only one car with 0 seats, I will correct its data from internet.

In [42]:
cars.loc[cars.Seats == 0, ['Seats']] = 5

In [43]:
cars.Seats.value_counts()

5.0     5015
7.0      674
8.0      134
4.0       99
6.0       31
2.0       16
10.0       5
9.0        3
Name: Seats, dtype: int64

#### New_Price

As mentioned earlier, the `New_Price` col has about `87%` rows with missing values, So I will remove it from the dataset.

In [44]:
cars.drop(columns = ['New_Price'], inplace = True)

In [45]:
cars.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,5983.0,2013.383085,3.249102,1998.0,2011.0,2014.0,2016.0,2019.0
Km,5983.0,58684.183186,91503.344783,171.0,33965.5,53000.0,73000.0,6500000.0
Seats,5977.0,5.279572,0.80596,2.0,5.0,5.0,5.0,10.0
Price,5983.0,9.496263,11.200462,0.44,3.5,5.65,9.95,160.0
Mileage,5926.0,18.366564,4.346963,6.4,15.3,18.2,21.1,110.0
Engine,5983.0,1621.27645,601.355233,72.0,1198.0,1493.0,1984.0,5998.0
Power,5876.0,113.25305,53.874957,34.2,75.0,97.7,138.1,560.0


##### End of prepairing data for analysis. 