## Problem Statement: You will be predicting the costs of used cars given the data collected from various sources and distributed across various locations in India.

### FEATURES:

Name             : The brand and model of the car.

Location         : The location in which the car is being sold or is available for purchase.

Year             : The year or edition of the model.

Kilometers_Driven: The total kilometres driven in the car by the previous owner(s) in KM.

Fuel_Type        : The type of fuel used by the car.

Transmission     : The type of transmission used by the car.

Owner_Type       : Whether the ownership is Firsthand, Second hand or other.

Mileage          : The standard mileage offered by the car company in kmpl or km/kg

Engine           : The displacement volume of the engine in cc.

Power            : The maximum power of the engine in bhp.

Seats            : The number of seats in the car.

Price            : The price of the used car in INR Lakhs.

## Tasks:

1.Clean Data(Null value removal, Outlier identification)

2.Null Values(Dropping the rows /Columns and what is the reason or how you are imputing the null).

3.EDA(Minor Project to understand the relations, repeat the same here)

4.Handle Categorical Variable(Using Label Encoding/One hot encoding)

5.Try to do data scaling for Kilometers driven

6.Do the train test  split

7.Apply different ML regression Algorithms

8.Calculate the error metrics.

In [69]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


In [70]:
df = pd.read_excel(r'Data_Train.xlsx')

In [71]:
df.describe()

Unnamed: 0,Year,Kilometers_Driven,Seats,Price
count,6019.0,6019.0,5977.0,6019.0
mean,2013.358199,58738.38,5.278735,9.479468
std,3.269742,91268.84,0.80884,11.187917
min,1998.0,171.0,0.0,0.44
25%,2011.0,34000.0,5.0,3.5
50%,2014.0,53000.0,5.0,5.64
75%,2016.0,73000.0,5.0,9.95
max,2019.0,6500000.0,10.0,160.0


In [72]:
df.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,Price
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,1.75
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,12.5
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,4.5
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,6.0
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,17.74


In [73]:
df.shape

(6019, 12)

In [74]:
df.dtypes

Name                  object
Location              object
Year                   int64
Kilometers_Driven      int64
Fuel_Type             object
Transmission          object
Owner_Type            object
Mileage               object
Engine                object
Power                 object
Seats                float64
Price                float64
dtype: object

In [75]:
df.dtypes

Name                  object
Location              object
Year                   int64
Kilometers_Driven      int64
Fuel_Type             object
Transmission          object
Owner_Type            object
Mileage               object
Engine                object
Power                 object
Seats                float64
Price                float64
dtype: object

In [76]:
uniqueName, occurrenceNumber = np.unique(df['Name'], return_counts=True)
print(uniqueName, "\n", occurrenceNumber)

['Ambassador Classic Nova Diesel' 'Audi A3 35 TDI Attraction'
 'Audi A3 35 TDI Premium' ... 'Volvo XC60 D5 Inscription'
 'Volvo XC90 2007-2015 D5 AT AWD' 'Volvo XC90 2007-2015 D5 AWD'] 
 [1 1 1 ... 1 1 1]


In [77]:
df['Seats'].isna().sum()

42

In [83]:
df.dropna(subset = ["Seats"], inplace=True)

In [86]:
df.dropna(subset = ["Engine"], inplace=True)

In [87]:
df['Name'].isna().sum()

0

In [88]:
print(df['Engine'].isna().sum())

0


In [89]:
uniqueEngine, occurrenceNumber= np.unique(df['Engine'].astype(str), return_counts = True)
print(uniqueEngine, occurrenceNumber)

['1047 CC' '1061 CC' '1086 CC' '1120 CC' '1150 CC' '1172 CC' '1186 CC'
 '1193 CC' '1194 CC' '1196 CC' '1197 CC' '1198 CC' '1199 CC' '1242 CC'
 '1248 CC' '1298 CC' '1299 CC' '1341 CC' '1343 CC' '1364 CC' '1368 CC'
 '1373 CC' '1388 CC' '1390 CC' '1395 CC' '1396 CC' '1399 CC' '1405 CC'
 '1422 CC' '1461 CC' '1462 CC' '1468 CC' '1489 CC' '1493 CC' '1495 CC'
 '1496 CC' '1497 CC' '1498 CC' '1499 CC' '1582 CC' '1586 CC' '1590 CC'
 '1591 CC' '1595 CC' '1596 CC' '1597 CC' '1598 CC' '1599 CC' '1781 CC'
 '1794 CC' '1796 CC' '1797 CC' '1798 CC' '1799 CC' '1896 CC' '1948 CC'
 '1950 CC' '1956 CC' '1968 CC' '1969 CC' '1978 CC' '1984 CC' '1985 CC'
 '1991 CC' '1995 CC' '1997 CC' '1998 CC' '1999 CC' '2092 CC' '2112 CC'
 '2143 CC' '2147 CC' '2148 CC' '2149 CC' '2179 CC' '2198 CC' '2199 CC'
 '2200 CC' '2349 CC' '2354 CC' '2359 CC' '2360 CC' '2362 CC' '2393 CC'
 '2400 CC' '2446 CC' '2477 CC' '2487 CC' '2489 CC' '2494 CC' '2495 CC'
 '2496 CC' '2497 CC' '2498 CC' '2499 CC' '2523 CC' '2609 CC' '2694 CC'
 '2696

In [56]:
df.shape

(5983, 12)

In [90]:
print(df['Seats'].isna().sum())

0


In [92]:
print(df['Mileage'].isna().sum())

2


In [93]:
df.dropna(subset=["Mileage"], inplace=True)

In [94]:
print(df['Mileage'].isna().sum())

0


In [95]:
print(df['Price'].isna().sum())

0


In [96]:
df.shape

(5975, 12)

In [97]:
uniqueSeats, occurrenceNumber= np.unique(df['Seats'].astype(str), return_counts = True)
print(uniqueSeats, occurrenceNumber)

['0.0' '10.0' '2.0' '4.0' '5.0' '6.0' '7.0' '8.0' '9.0'] [   1    5   16   99 5012   31  674  134    3]


In [66]:
#All nan values have been removed
# Onto the outlier detection part