##### Data Validation
This data contains 6738 rows, 9 columns. The following description of the data was given me:
- **model**: Character, the model of the car, 18 possible values
- **year**: Numeric, year of registration from 1998 to 2020
- **price**: Numeric, listed value of the car in GBP
- **transmission**: Character, one of "Manual", "Automatic", "Semi-Auto" or "Other"
- **mileage**: Numeric, listed mileage of the car at time of sale
- **fuelType**: Character, one of "Petrol", "Hybrid", "Diesel" or "Other"
- **tax**: Numeric, road tax in GBP. Calculated based on CO2 emissions or a fixed price depending on the age of the car.
- **mpg**: Numeric, miles per gallon as reported by manufacturer.
- **engineSize**: Numeric, listed engine size, one of 16 possible values.


I have made a minor adjust in **model**, striping white spaces from the category. All other columns where checked.

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [52]:
data = pd.read_csv('Data/toyota.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6738 entries, 0 to 6737
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         6738 non-null   object 
 1   year          6738 non-null   int64  
 2   price         6738 non-null   int64  
 3   transmission  6738 non-null   object 
 4   mileage       6738 non-null   int64  
 5   fuelType      6738 non-null   object 
 6   tax           6738 non-null   int64  
 7   mpg           6738 non-null   float64
 8   engineSize    6738 non-null   float64
dtypes: float64(2), int64(4), object(3)
memory usage: 473.9+ KB


In [53]:
#Check for model 18 possible values
print(data['model'].nunique())
print(data['model'].unique())
data['model'] = data['model'].str.strip(' ') #Clean model column for white spaces

18
[' GT86' ' Corolla' ' RAV4' ' Yaris' ' Auris' ' Aygo' ' C-HR' ' Prius'
 ' Avensis' ' Verso' ' Hilux' ' PROACE VERSO' ' Land Cruiser' ' Supra'
 ' Camry' ' Verso-S' ' IQ' ' Urban Cruiser']


In [54]:
#Check for data range from 1998 to 2020
print(np.sort(data['year'].unique()))
data['year'] = pd.to_datetime(data['year'], format='%Y') #Transforming data type

[1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
 2012 2013 2014 2015 2016 2017 2018 2019 2020]


In [56]:
#Check for transmission values
print(data['transmission'].unique())

['Manual' 'Automatic' 'Semi-Auto' 'Other']


In [57]:
#Check for fuelType values
print(data['fuelType'].unique())

['Petrol' 'Other' 'Hybrid' 'Diesel']


In [58]:
#Check for engineSize 16 possible values
print(data['engineSize'].nunique())
print(data['engineSize'].unique())

16
[2.  1.8 1.2 1.6 1.4 2.5 2.2 1.5 1.  1.3 0.  2.4 3.  2.8 4.2 4.5]


In [55]:
#Check for negative values in data
data.describe()

Unnamed: 0,year,price,mileage,tax,mpg,engineSize
count,6738,6738.0,6738.0,6738.0,6738.0,6738.0
mean,2016-09-30 16:27:21.317898752,12522.391066,22857.413921,94.69724,63.042223,1.471297
min,1998-01-01 00:00:00,850.0,2.0,0.0,2.8,0.0
25%,2016-01-01 00:00:00,8290.0,9446.0,0.0,55.4,1.0
50%,2017-01-01 00:00:00,10795.0,18513.0,135.0,62.8,1.5
75%,2018-01-01 00:00:00,14995.0,31063.75,145.0,69.0,1.8
max,2020-01-01 00:00:00,59995.0,174419.0,565.0,235.0,4.5
std,,6345.017587,19125.464147,73.880776,15.83671,0.436159
