We'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle (https://www.kaggle.com/orgesleka/used-cars-database/data). The aim of this project is to clean the data and analyze the included used car listings.

In [3]:
#Importing libraries
import pandas as pd
import numpy as np


In [None]:
#Parse the data
autos = pd.read_csv ('autos.csv', encoding = 'Latin-1')
autos.info()
autos.head(3)

In [2]:
#Check the data
autos.info()
autos.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


As we can see raw data is not clear: some columns have different numbers of non-null objects (< 50 0000). Names of the columns are "CamelCase". Lets make it "snake_case"

In [3]:
#Change column names:
names = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'picture', 'postal_code',
       'last_seen']
autos.set_axis(names, axis = 'columns', inplace = True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'picture', 'postal_code',
       'last_seen'],
      dtype='object')

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. 

In [4]:
#Check the data
autos.describe(include = 'object')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,gearbox,model,odometer_km,fuel_type,brand,unrepaired_damage,ad_created,last_seen
count,50000,50000,50000,50000,50000,50000,44905,47320,47242,50000,45518,50000,40171,50000,50000
unique,48213,38754,2,2,2357,2,8,2,245,13,7,40,2,76,39481
top,2016-04-04 16:40:33,Ford_Fiesta,privat,Angebot,$0,test,limousine,manuell,golf,"150,000km",benzin,volkswagen,nein,2016-04-03 00:00:00,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,36993,4024,32424,30107,10687,35232,1946,8


Lets modifided text (price and odometer_km) to integer

In [5]:
#Convert tex to integer
autos['price'] = (autos['price'].str.replace('$','')
                                .str.replace(',','')
                                .astype(int)
                 )

autos['odometer_km'] = (autos['odometer_km'].str.replace(',','')
                                .str.replace('km','')
                                .astype(int)
                 )
autos['registration_year'] = autos['registration_year'].astype(int)
autos['registration_month'] = autos['registration_month'].astype(int)

autos.head(3)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,picture,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


We'll start by analyzing the odometer_km and price columns. 
Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we want to remove.


In [6]:
#Check odometer and price
autos['odometer_km'].unique().shape
autos['odometer_km'].describe()
autos['odometer_km'].value_counts().sort_index(ascending = True)

autos['price'].unique().shape
autos['price'].describe()
autos['price'].value_counts().sort_index(ascending = False).head(100)

autos = autos[autos["price"].between(1,350000)]

autos['price'].value_counts().sort_index(ascending = True).head(100)

autos.describe()

Unnamed: 0,price,registration_year,power_ps,odometer_km,registration_month,picture,postal_code
count,48565.0,48565.0,48565.0,48565.0,48565.0,48565.0,48565.0
mean,5888.935591,2004.755421,117.197158,125770.101925,5.782251,0.0,50975.745207
std,9059.854754,88.643887,200.649618,39788.636804,3.685595,0.0,25746.968398
min,1.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1200.0,1999.0,71.0,125000.0,3.0,0.0,30657.0
50%,3000.0,2004.0,107.0,150000.0,6.0,0.0,49716.0
75%,7490.0,2008.0,150.0,150000.0,9.0,0.0,71665.0
max,350000.0,9999.0,17700.0,150000.0,12.0,0.0,99998.0


Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values

In [7]:
autos.describe(include = 'object')
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)

autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Lets remove them.

In [8]:
autos = autos[autos["registration_year"].between(1920,2018)]
autos['registration_year'].value_counts().sort_index(ascending = True).head(100)

1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       4
1957       2
1958       4
1959       6
1960      23
1961       6
1962       4
1963       8
1964      12
1965      17
1966      22
1967      26
1968      26
1969      19
        ... 
1989     174
1990     347
1991     339
1992     370
1993     425
1994     629
1995    1227
1996    1373
1997    1951
1998    2363
1999    2897
2000    3156
2001    2636
2002    2486
2003    2699
2004    2703
2005    2936
2006    2670
2007    2277
2008    2215
2009    2085
2010    1589
2011    1623
2012    1310
2013     803
2014     663
2015     392
2016    1220
2017    1392
2018     470
Name: registration_year, Length: 79, dtype: int64

When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column.

In [9]:
autos['brand'].value_counts()

volkswagen        10331
bmw                5274
opel               5272
mercedes_benz      4650
audi               4168
ford               3382
renault            2324
peugeot            1430
fiat               1262
seat                919
skoda               780
nissan              741
mazda               739
smart               694
citroen             685
toyota              611
hyundai             483
sonstige_autos      467
volvo               439
mini                418
mitsubishi          397
honda               387
kia                 345
alfa_romeo          320
porsche             287
suzuki              286
chevrolet           275
chrysler            169
dacia               129
daihatsu            122
jeep                107
subaru              102
land_rover           99
saab                 79
daewoo               76
jaguar               74
trabant              66
rover                65
lancia               55
lada                 29
Name: brand, dtype: int64

For the top 10 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.

In [4]:
top10 = autos['brand'].value_counts().sort_values(ascending = False).head(10)
print(top10)
brands = autos['brand'].unique()
autos_mile ={}

for b in top10.index:
    rows = autos[autos['brand'] == b]
    mean_miles = rows['odometer_km'].mean()
    autos_mile[b] = mean_miles

print('.....')
print(autos_mile)

NameError: name 'autos' is not defined