
## step 1: Importing libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## step 2: Reading datasets

In [8]:
data= pd.read_csv("used_cars_data.csv")

## Analyzing the data
- shape
- head()
- tail()
- info()

In [9]:
data.head()

Unnamed: 0,S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


In [10]:
data.shape

(7253, 14)

In [11]:
data.tail()

Unnamed: 0,S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
7248,7248,Volkswagen Vento Diesel Trendline,Hyderabad,2011,89411,Diesel,Manual,First,20.54 kmpl,1598 CC,103.6 bhp,5.0,,
7249,7249,Volkswagen Polo GT TSI,Mumbai,2015,59000,Petrol,Automatic,First,17.21 kmpl,1197 CC,103.6 bhp,5.0,,
7250,7250,Nissan Micra Diesel XV,Kolkata,2012,28000,Diesel,Manual,First,23.08 kmpl,1461 CC,63.1 bhp,5.0,,
7251,7251,Volkswagen Polo GT TSI,Pune,2013,52262,Petrol,Automatic,Third,17.2 kmpl,1197 CC,103.6 bhp,5.0,,
7252,7252,Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan...,Kochi,2014,72443,Diesel,Automatic,First,10.0 kmpl,2148 CC,170 bhp,5.0,,


In [13]:
# helps to understand the data type and information about data,
# including the number of records in each column, data having null or not null, Data type, the memory usage of the data set.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7253 entries, 0 to 7252
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   S.No.              7253 non-null   int64  
 1   Name               7253 non-null   object 
 2   Location           7253 non-null   object 
 3   Year               7253 non-null   int64  
 4   Kilometers_Driven  7253 non-null   int64  
 5   Fuel_Type          7253 non-null   object 
 6   Transmission       7253 non-null   object 
 7   Owner_Type         7253 non-null   object 
 8   Mileage            7251 non-null   object 
 9   Engine             7207 non-null   object 
 10  Power              7207 non-null   object 
 11  Seats              7200 non-null   float64
 12  New_Price          1006 non-null   object 
 13  Price              6019 non-null   float64
dtypes: float64(2), int64(3), object(9)
memory usage: 793.4+ KB


## Check for Duplication
> nunique() based on several unique values in each column and the data description, we can identify the continuous and categorical columns in the data. Duplicated data can be handled or removed based on further analysis.

In [8]:
data.nunique()

S.No.                7253
Name                 2041
Location               11
Year                   23
Kilometers_Driven    3660
Fuel_Type               5
Transmission            2
Owner_Type              4
Mileage               450
Engine                150
Power                 386
Seats                   9
New_Price             625
Price                1373
dtype: int64

## missing values
> isnull() is widely been in all pre-processing steps to identify null values in the data.
> Here, in our example, data.isnull().sum() is used to get the number of missing records in each columns.

In [18]:
data.isnull().sum()

S.No.                   0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 46
Power                  46
Seats                  53
New_Price            6247
Price                1234
dtype: int64

## Calculate the percentage of missing values in each column


In [19]:
(data.isnull().sum()/(len(data)))*100

S.No.                 0.000000
Name                  0.000000
Location              0.000000
Year                  0.000000
Kilometers_Driven     0.000000
Fuel_Type             0.000000
Transmission          0.000000
Owner_Type            0.000000
Mileage               0.027575
Engine                0.634220
Power                 0.634220
Seats                 0.730732
New_Price            86.129877
Price                17.013650
dtype: float64

## Step-3: Data Reduction

> Some columns or variables can be dropped if they do not add value to our analysis.
> In our dataset, the column S.No. have only ID values, assuming they don't have any predictive power to predict the dependent variable.

In [20]:
# remove S.No. column from data

data= data.drop(["S.No."],axis=1)

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7253 entries, 0 to 7252
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               7253 non-null   object 
 1   Location           7253 non-null   object 
 2   Year               7253 non-null   int64  
 3   Kilometers_Driven  7253 non-null   int64  
 4   Fuel_Type          7253 non-null   object 
 5   Transmission       7253 non-null   object 
 6   Owner_Type         7253 non-null   object 
 7   Mileage            7251 non-null   object 
 8   Engine             7207 non-null   object 
 9   Power              7207 non-null   object 
 10  Seats              7200 non-null   float64
 11  New_Price          1006 non-null   object 
 12  Price              6019 non-null   float64
dtypes: float64(2), int64(2), object(9)
memory usage: 736.8+ KB


## Step-4: Feature Engineering

> Feature engineering refers to the process of using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model machine learning or statistical modeling. The main goal of feature engineering is to create meaningful data from raw data.

## Step-5: Creating New Feature

> We will play around with variables Year and Name in our dataset. If we see the sample data, the column 'Year' shows the manufacturing year of the car.

> It would be difficult to find the car's age if it is in year format as the Age of car is a contributing factor to Car Price.

### Introducing a new column, 'Car_Age' to know the age of the car.

In [26]:
from datetime import date 
date.today().year

2023

In [27]:
data["Car_Age"]= date.today().year - data['Year']
data.head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,Car_Age
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75,13
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5,8
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5,12
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0,11
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74,10


Since car names will not be great predictors of the price in our current data. But we can process this column to extract important information using brand and Model Names.
**Let's split the name and introduce new variables "Brand" and "Model"**

In [34]:
data.Name.str.split().str.get(0)

0              Maruti
1             Hyundai
2               Honda
3              Maruti
4                Audi
            ...      
7248       Volkswagen
7249       Volkswagen
7250           Nissan
7251       Volkswagen
7252    Mercedes-Benz
Name: Name, Length: 7253, dtype: object

In [36]:
data['Name'].str.split().str.get(0)

0              Maruti
1             Hyundai
2               Honda
3              Maruti
4                Audi
            ...      
7248       Volkswagen
7249       Volkswagen
7250           Nissan
7251       Volkswagen
7252    Mercedes-Benz
Name: Name, Length: 7253, dtype: object

In [54]:
data['Brand']= data['Name'].str.split().str.get(0)

In [55]:
data['Model']= data['Name'].str.split().str.get(1) + data['Name'].str.split().str.get(2) 

In [56]:
data[['Name','Brand','Model']]

Unnamed: 0,Name,Brand,Model
0,Maruti Wagon R LXI CNG,Maruti,WagonR
1,Hyundai Creta 1.6 CRDi SX Option,Hyundai,Creta1.6
2,Honda Jazz V,Honda,JazzV
3,Maruti Ertiga VDI,Maruti,ErtigaVDI
4,Audi A4 New 2.0 TDI Multitronic,Audi,A4New
...,...,...,...
7248,Volkswagen Vento Diesel Trendline,Volkswagen,VentoDiesel
7249,Volkswagen Polo GT TSI,Volkswagen,PoloGT
7250,Nissan Micra Diesel XV,Nissan,MicraDiesel
7251,Volkswagen Polo GT TSI,Volkswagen,PoloGT


## step 6: Data cleaning

> Some names of the variables are not relevant and not easy to understand. Some data may have data entry errors, and some variables may need data type conversion, we need to fix this issue in the data.

> In the example, The brand name 'Isuzu' 'ISUZU' and 'Mini' and 'Land' looks incorrect. This need to be corrected.

In [57]:
print(data.Brand.unique())

['Maruti' 'Hyundai' 'Honda' 'Audi' 'Nissan' 'Toyota' 'Volkswagen' 'Tata'
 'Land' 'Mitsubishi' 'Renault' 'Mercedes-Benz' 'BMW' 'Mahindra' 'Ford'
 'Porsche' 'Datsun' 'Jaguar' 'Volvo' 'Chevrolet' 'Skoda' 'Mini' 'Fiat'
 'Jeep' 'Smart' 'Ambassador' 'Isuzu' 'ISUZU' 'Force' 'Bentley'
 'Lamborghini' 'Hindustan' 'OpelCorsa']


In [58]:
print(data.Brand.nunique())

33


In [59]:
searchfor=['Isuzu','ISUZU', 'Land','Mini']
data[data['Brand'].str.contains('|'.join(searchfor))].head()

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,Car_Age,Brand,Model,Type
13,Land Rover Range Rover 2.2L Pure,Delhi,2014,72000,Diesel,Automatic,First,12.7 kmpl,2179 CC,187.7 bhp,5.0,,27.0,9,Land,RoverRange,Rover
14,Land Rover Freelander 2 TD4 SE,Pune,2012,85000,Diesel,Automatic,Second,0.0 kmpl,2179 CC,115 bhp,5.0,,17.5,11,Land,RoverFreelander,2
176,Mini Countryman Cooper D,Jaipur,2017,8525,Diesel,Automatic,Second,16.6 kmpl,1998 CC,112 bhp,5.0,,23.0,6,Mini,CountrymanCooper,D
191,Land Rover Range Rover 2.2L Dynamic,Coimbatore,2018,36091,Diesel,Automatic,First,12.7 kmpl,2179 CC,187.7 bhp,5.0,,55.76,5,Land,RoverRange,Rover
228,Mini Cooper Convertible S,Kochi,2017,26327,Petrol,Automatic,First,16.82 kmpl,1998 CC,189.08 bhp,4.0,44.28 Lakh,35.67,6,Mini,CooperConvertible,S


In [60]:
data['Brand'].replace({'ISUZU':'Isuzu','Mini':'Mini Cooper','Land':'Land Rover'}, inplace= True)

In [63]:
data

Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,Car_Age,Brand,Model,Type
0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75,13,Maruti,WagonR,LXI
1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.50,8,Hyundai,Creta1.6,CRDi
2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.50,12,Honda,JazzV,
3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.00,11,Maruti,ErtigaVDI,
4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74,10,Audi,A4New,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7248,Volkswagen Vento Diesel Trendline,Hyderabad,2011,89411,Diesel,Manual,First,20.54 kmpl,1598 CC,103.6 bhp,5.0,,,12,Volkswagen,VentoDiesel,Trendline
7249,Volkswagen Polo GT TSI,Mumbai,2015,59000,Petrol,Automatic,First,17.21 kmpl,1197 CC,103.6 bhp,5.0,,,8,Volkswagen,PoloGT,TSI
7250,Nissan Micra Diesel XV,Kolkata,2012,28000,Diesel,Manual,First,23.08 kmpl,1461 CC,63.1 bhp,5.0,,,11,Nissan,MicraDiesel,XV
7251,Volkswagen Polo GT TSI,Pune,2013,52262,Petrol,Automatic,Third,17.2 kmpl,1197 CC,103.6 bhp,5.0,,,10,Volkswagen,PoloGT,TSI


We have done the fundamental data analysis, Feature, and data clean-up. Let's move to the EDA process.

Our Data is ready to perform EDA.

## Step-7: EDA - Exploratory Data Analysis

> Exploratory Data Analysis refers to the crucial process of performing initial investigations on data to discover patterns to check assumptions with the help of summary statistics and graphical representations.

- EDA can be leveraged to check for outliers, patterns, and trends in the given data.
- EDA helps to find meaningful patterns in data.
- EDA provides in-depth insigns into the data sets to solve our business problems.
- EDA gives a clue to impute missing values in the dataset.

## Step-8: Statistics Summary

The information gives a quick and simple description of the data.

Can include Count, Mean, Standard Deviation, Median, mode, minimum value, maximum value, range etc.

Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error, distribution of data such as the data is normally distributed or left/right skewed.

In python, this can be achieved using describe()

describe() function gives all statistics summary of data.

**describe()** - Provide a statistics summary of data belonging to numerical datatype such as int, float.

In [64]:
data.describe()

Unnamed: 0,Year,Kilometers_Driven,Seats,Price,Car_Age
count,7253.0,7253.0,7200.0,6019.0,7253.0
mean,2013.365366,58699.06,5.279722,9.479468,9.634634
std,3.254421,84427.72,0.81166,11.187917,3.254421
min,1996.0,171.0,0.0,0.44,4.0
25%,2011.0,34000.0,5.0,3.5,7.0
50%,2014.0,53416.0,5.0,5.64,9.0
75%,2016.0,73000.0,5.0,9.95,12.0
max,2019.0,6500000.0,10.0,160.0,27.0


In [65]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,7253.0,2013.365366,3.254421,1996.0,2011.0,2014.0,2016.0,2019.0
Kilometers_Driven,7253.0,58699.063146,84427.720583,171.0,34000.0,53416.0,73000.0,6500000.0
Seats,7200.0,5.279722,0.81166,0.0,5.0,5.0,5.0,10.0
Price,6019.0,9.479468,11.187917,0.44,3.5,5.64,9.95,160.0
Car_Age,7253.0,9.634634,3.254421,4.0,7.0,9.0,12.0,27.0


In [67]:
data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Name,7253.0,2041.0,Mahindra XUV500 W8 2WD,55.0,,,,,,,
Location,7253.0,11.0,Mumbai,949.0,,,,,,,
Year,7253.0,,,,2013.365366,3.254421,1996.0,2011.0,2014.0,2016.0,2019.0
Kilometers_Driven,7253.0,,,,58699.063146,84427.720583,171.0,34000.0,53416.0,73000.0,6500000.0
Fuel_Type,7253.0,5.0,Diesel,3852.0,,,,,,,
Transmission,7253.0,2.0,Manual,5204.0,,,,,,,
Owner_Type,7253.0,4.0,First,5952.0,,,,,,,
Mileage,7251.0,450.0,17.0 kmpl,207.0,,,,,,,
Engine,7207.0,150.0,1197 CC,732.0,,,,,,,
Power,7207.0,386.0,74 bhp,280.0,,,,,,,


### **Before we do EDA, lets seperate Numerical and Categorical variables for easy analysis.**

In [71]:
cat_cols= data.select_dtypes(include=['object']).columns.tolist()
num_cols= data.select_dtypes(include= np.number).columns.tolist()
print("Categorical variables: ")
print(cat_cols)
print("Numerical variables: ")
print(num_cols)

Categorical variables: 
['Name', 'Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'New_Price', 'Brand', 'Model', 'Type']
Numerical variables: 
['Year', 'Kilometers_Driven', 'Seats', 'Price', 'Car_Age']


## Step-9: EDA Univariate Analysis

Analyzing/visualizing the dataset by taking one variable at a time:

Data visualization is essential; we must decide what charts to plot to better understand the data. Here, we visualize our data using **matplotlib** and **seaborn** libraries.

Matplotlib is a Python 2D plotting library used to draw basic charts we use Matplotlib.

Seaborn is also a Python library built on top of Matplotlib that uses short lines of code to create and style statistical plots from Pandas and Numpy.

Univariate analysis can be done for both Categorical and Numerical variables.

Categorical variables can be visualized using a Count plot, Bar Chart, Pie plot etc.

Numerical Variables can be visualized using Histogram, Box plot, Density plot etc.

In our example, we have done a Univariate analysis using Histogram and Box Plot for continuous Variables.

In the below example or figure, a histogram and box plot is used to show the pattern of the variables, as some variables have skewness and outliers