# Used Car Sales

## Task description

Can you help us estimate the price we should list a car for? The team estimates are always around 30% away from the price we know the car will sell for, we really want to be within 10% of the price.

## Data Validation

Here are the first five rows of our data:

In [4]:
import pandas as pd

data = pd.read_csv('toyota.csv')
data.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,GT86,2016,16000,Manual,24089,Petrol,265,36.2,2.0
1,GT86,2017,15995,Manual,18615,Petrol,145,36.2,2.0
2,GT86,2015,13998,Manual,27469,Petrol,265,36.2,2.0
3,GT86,2017,18998,Manual,14736,Petrol,150,36.2,2.0
4,GT86,2017,17498,Manual,36284,Petrol,145,36.2,2.0


We were provided with the following description of the dataset:

**model** Character, the model of the car, 18 possible values

**year** Numeric, year of registration from 1998 to 2020

**price** Numeric, listed value of the car in GBP

**transmission** Character, one of "Manual", "Automatic", "Semi-Auto" or "Other"

**mileage** Numeric, listed mileage of the car at time of sale

**fuelType** Character, one of "Petrol", "Hybrid", "Diesel" or "Other

**tax** Numeric, road tax in GBP. Calculated based on CO2 emissions or
a fixed price depending on the age of the car.

**mpg** Numeric, miles per gallon as reported by manufacturer

**engineSize** Numeric, listed engine size, one of 16 possible values

As you can see below, data types are matching the description, and there are no missing values:

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6738 entries, 0 to 6737
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         6738 non-null   object 
 1   year          6738 non-null   int64  
 2   price         6738 non-null   int64  
 3   transmission  6738 non-null   object 
 4   mileage       6738 non-null   int64  
 5   fuelType      6738 non-null   object 
 6   tax           6738 non-null   int64  
 7   mpg           6738 non-null   float64
 8   engineSize    6738 non-null   float64
dtypes: float64(2), int64(4), object(3)
memory usage: 473.9+ KB


Let's validate each column of the dataset.

**model** Character, the model of the car, 18 possible values

Indeed, we have 18 models:

In [6]:
print(data['model'].describe())
data['model'].unique()

count       6738
unique        18
top        Yaris
freq        2122
Name: model, dtype: object


array([' GT86', ' Corolla', ' RAV4', ' Yaris', ' Auris', ' Aygo', ' C-HR',
       ' Prius', ' Avensis', ' Verso', ' Hilux', ' PROACE VERSO',
       ' Land Cruiser', ' Supra', ' Camry', ' Verso-S', ' IQ',
       ' Urban Cruiser'], dtype=object)

**year** Numeric, year of registration from 1998 to 2020

This is correct:

In [7]:
import numpy as np

np.sort(data['year'].unique())

array([1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019,
       2020], dtype=int64)

**price** Numeric, listed value of the car in GBP

Prices are from 850 to 59,995:

In [8]:
data['price'].describe()

count     6738.000000
mean     12522.391066
std       6345.017587
min        850.000000
25%       8290.000000
50%      10795.000000
75%      14995.000000
max      59995.000000
Name: price, dtype: float64

**transmission** Character, one of "Manual", "Automatic", "Semi-Auto" or "Other"

This is correct:

In [9]:
data['transmission'].unique()

array(['Manual', 'Automatic', 'Semi-Auto', 'Other'], dtype=object)

**mileage** Numeric, listed mileage of the car at time of sale

Mileage is from 2 to 174,419:

In [10]:
data['mileage'].describe()

count      6738.000000
mean      22857.413921
std       19125.464147
min           2.000000
25%        9446.000000
50%       18513.000000
75%       31063.750000
max      174419.000000
Name: mileage, dtype: float64

**fuelType** Character, one of "Petrol", "Hybrid", "Diesel" or "Other"

This is correct:

In [11]:
data['fuelType'].unique()

array(['Petrol', 'Other', 'Hybrid', 'Diesel'], dtype=object)

**tax** Numeric, road tax in GBP. Calculated based on CO2 emissions or a fixed price depending on the age of the car.

Road tax is from 0 to 565 GBP:

In [13]:
data['tax'].describe()

count    6738.000000
mean       94.697240
std        73.880776
min         0.000000
25%         0.000000
50%       135.000000
75%       145.000000
max       565.000000
Name: tax, dtype: float64

**mpg** Numeric, miles per gallon as reported by manufacturer

Miles per gallon are from 2.8 to 235:

In [14]:
data['mpg'].describe()

count    6738.000000
mean       63.042223
std        15.836710
min         2.800000
25%        55.400000
50%        62.800000
75%        69.000000
max       235.000000
Name: mpg, dtype: float64

**engineSize** Numeric, listed engine size, one of 16 possible values

Indeed, we have 16 unique values:

In [17]:
print("Unique sizes #: " + str(len(data['engineSize'].unique())))
data['engineSize'].unique()

Unique sizes #: 16


array([2. , 1.8, 1.2, 1.6, 1.4, 2.5, 2.2, 1.5, 1. , 1.3, 0. , 2.4, 3. ,
       2.8, 4.2, 4.5])

Hovewer, one of the unique values is zero. Let's explore which cars have engine size listed as 0:

In [18]:
data[data['engineSize'] == 0]

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
2535,Yaris,2016,12300,Manual,6148,Hybrid,0,86.0,0.0
2545,Yaris,2016,11000,Automatic,39909,Hybrid,0,86.0,0.0
5126,Aygo,2019,9800,Manual,3635,Petrol,150,56.5,0.0
5233,Aygo,2019,8000,Manual,8531,Petrol,145,56.5,0.0
5257,Aygo,2019,8000,Manual,5354,Petrol,145,56.5,0.0
5960,C-HR,2017,14300,Manual,46571,Petrol,145,47.1,0.0


Since engine size can't be zero, let's substitute zeros with the most common engine size for given model and fuel type:

In [19]:
for index in data[data['engineSize'] == 0].index:
    data.loc[index, 'engineSize'] = data[(data['model'] == data.loc[index, 'model']) & \
                    (data['fuelType'] == data.loc[index, 'fuelType'])]['engineSize'].mode().values[0]
    
print("Unique sizes #: " + str(len(data['engineSize'].unique())))
data['engineSize'].unique()

Unique sizes #: 15


array([2. , 1.8, 1.2, 1.6, 1.4, 2.5, 2.2, 1.5, 1. , 1.3, 2.4, 3. , 2.8,
       4.2, 4.5])

## Exploratory Analysis