# Ebay Car Sales Data

This project uses the autos.csv data file, which contains information on used car sales on Germany's Ebay website.  While the data was originally posted on Kaggle, it has been modified to be smaller (to run faster) and dirtier (to more closely resemble actual datasets).

The following information is included in the file:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which which year the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which which year the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

In [1]:
import numpy as np
import pandas as pd

# use Latin-1 encoding, as default (UTF-8) was not successful
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [2]:
# review the dataframe created
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [3]:
# use the info and head methods to review data
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


A few items to note from this initial review of the data:
- several columns have missing data
- most of the text in the data is in German, not English
- much of the data is stored as objects, rather than numbers
- the column names use camelCase, rather than snake_case

In [4]:
# print the column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
# revise the column headers, use snake case, clarify labels
cols = list(autos.columns)
cols[0] = 'date_crawled'
cols[3] = 'offer_type'
cols[5] = 'ab_test'
cols[6] = 'vehicle_type'
cols[7] = 'registration_year'
cols[9] = 'power_ps'
cols[12] = 'registration_month'
cols[13] = 'fuel_type'
cols[15] = 'unrepaired_damage'
cols[16] = 'ad_created'
cols[17] = 'number_pictures'
cols[18] = 'postal_code'
cols[19] = 'last_seen'
autos.columns = cols

In [6]:
# review the results
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Updating the column names to snakecase helps to keep everything in the expected format for Python.  Clarifying the data in the columns will make the data easier to work with going forward.

In [7]:
# get descriptive statistics
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-22 09:51:06,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


## Initial Data Review

From the above, it appears that there are a number of columns that are predominantly one value. These columns can be dropped, as they don't appear to have any useful information:
- `seller`
- `offer_type`
- `number_pictures`

The following columns appear to be numeric columns stored as text and should be cleaned:
- `price`
- `odometer`

Further investigation is needed on the following columns:
- `vehicle_type` - there appear to be eight possible values?
- `registration_year` - there appear to be some bad values (too small or too large to be a year)
- `power_ps` - there is at least one very large value - are there outliers?
- `registration_month` - there should only be integers from 1 to 12
- `fuel_type` - there appear to be seven possible values?
- `unrepaired_damage` - is this a yes/no field?
- `date_crawled` and `ad_created` and `last_seen` - do we need to split these date columns to more easily work with this data? 

In [8]:
# drop seller and offer_type and number_pictures
autos.drop(['seller', 'offer_type', 'number_pictures'], axis=1)

Unnamed: 0,date_crawled,name,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,15749,2016-04-06 10:46:35


In [9]:
# clean the price column
autos['price'] = (autos['price']
                      .str.replace(',','')
                      .str.replace('$','')
                      .astype(float)
                 )
autos['price'].head()

0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
Name: price, dtype: float64

In [10]:
# clean the odometer column
autos['odometer'] = (autos['odometer']
                      .str.replace(',','')
                      .str.replace('km','')
                      .astype(float)
                 )
autos['odometer_km'] = autos['odometer']
autos.drop(['odometer'], axis=1)
autos['odometer_km'].head()

0    150000.0
1    150000.0
2     70000.0
3     70000.0
4    150000.0
Name: odometer_km, dtype: float64

In [11]:
# review vehicle_type categories for reasonableness
autos['vehicle_type'].value_counts()

limousine     12859
kleinwagen    10822
kombi          9127
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

`vehicle_type` looks to be okay. There do not appear to be multiple categories that should be the same or misspellings.

In [31]:
# review registration_year for possibly incorrect values
autos.loc[(autos['registration_year'] < 1900) | (autos['registration_year'] > 2018),
         'registration_year']

453      4500
4164     5000
4549     4100
5763     2019
6308     9996
8012     9999
8360     6200
10556    1800
13559    9000
14341    9999
22316    1000
22799    5000
24511    1111
24519    5000
25003    8888
27578    2800
27618    5911
32585    1800
33950    9999
35238    1500
38076    9999
38342    2019
42079    4800
49153    5000
49185    2019
49283    1001
49910    9000
Name: registration_year, dtype: int64

These values are clearly incorrect.  Some of these are first registered in 2019 (it's not yet 2019 - so that's not right).  Others appear to be dummied data or typos (was 2015 entered as 15, and got converted to 1500?).  These will need to be corrected.

In [13]:
# review power_ps for values that appear to be unreasonably large
autos.loc[(autos['power_ps'] > 10000), 'power_ps']

16743    14009
22592    15001
35039    16312
36421    17700
45671    15016
46986    16011
Name: power_ps, dtype: int64

These six values appear to be unreasonably large compared to the rest of the data.  These may be incorrect values and need to be corrected.

In [14]:
# review registration_month values for reasonableness
autos['registration_month'].value_counts()

0     5075
3     5071
6     4368
5     4107
4     4102
7     3949
10    3651
12    3447
9     3389
11    3360
1     3282
8     3191
2     3008
Name: registration_month, dtype: int64

Since these are months, we should only see values from 1 to 12 inclusive.  The zeros need to be corrected.

In [15]:
# review fuel_type categories for reasonableness
autos['fuel_type'].value_counts()

benzin     30107
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

`fuel_type` looks to be okay. There do not appear to be multiple categories that should be the same or misspellings.

In [16]:
# review unrepaired_damage categories for reasonableness
autos['unrepaired_damage'].value_counts()

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

`unrepaired_damage` appears to be a yes/no (ja/nein) field.  This is okay.

# Additional Data Cleaning - price and odometer_km

Let's further explore the `price` and `odometer_km` columns that we converted to numeric values above.

In [17]:
# price - determine number of unique values
autos['price'].unique().shape

(2357,)

In [18]:
# price - view descriptive statistics
autos['price'].describe(percentiles=[.01,.25,.50,.75,.99])

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
1%       0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
99%      3.590000e+04
max      1.000000e+08
Name: price, dtype: float64

In [19]:
# price - view highest values
autos['price'].value_counts().sort_index().tail(15)

265000.0      1
295000.0      1
299000.0      1
345000.0      1
350000.0      1
999990.0      1
999999.0      2
1234566.0     1
1300000.0     1
3890000.0     1
10000000.0    1
11111111.0    2
12345678.0    3
27322222.0    1
99999999.0    1
Name: price, dtype: int64

In [20]:
# price - view lowest values
autos['price'].value_counts().sort_index().head(30)

0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7
11.0       2
12.0       3
13.0       2
14.0       1
15.0       2
17.0       3
18.0       1
20.0       4
25.0       5
29.0       1
30.0       7
35.0       1
40.0       6
45.0       4
47.0       1
49.0       4
50.0      49
55.0       2
59.0       1
60.0       9
65.0       5
66.0       1
Name: price, dtype: int64

The mean price is about 9,840 and the standard deviation is about 481,104. However, it appears that all of the values above 350,000 are not valid values (e.g. 1234566 or 11111111).  On the low end, considering this is an auction site, these low values do not necessarily seem to be incorrect.  I will remove the rows with prices exceeding 350,000

In [21]:
# remove rows with bad data in price
autos = autos.loc[(autos['price'] <= 350000), :]

In [22]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,...,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_pictures,postal_code,last_seen,odometer_km
count,49986,49986,49986,49986,49986.0,49986,44894,49986.0,47310,49986.0,...,49986.0,49986.0,45509,49986,40163,49986,49986.0,49986.0,49986,49986.0
unique,48200,38743,2,2,,2,8,,2,,...,,,7,40,2,76,,,39472,
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,...,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27,
freq,3,78,49985,49985,,25750,12854,,36985,,...,,,30100,10684,35225,1946,,,8,
mean,,,,,5721.525167,,,2005.075721,,116.341196,...,125736.506222,5.723723,,,,,0.0,50812.804225,,125736.506222
std,,,,,8983.61782,,,105.727161,,209.218012,...,40038.133399,3.711839,,,,,0.0,25777.404967,,40038.133399
min,,,,,0.0,,,1000.0,,0.0,...,5000.0,0.0,,,,,0.0,1067.0,,5000.0
25%,,,,,1100.0,,,1999.0,,70.0,...,125000.0,3.0,,,,,0.0,30451.0,,125000.0
50%,,,,,2950.0,,,2003.0,,105.0,...,150000.0,6.0,,,,,0.0,49571.0,,150000.0
75%,,,,,7200.0,,,2008.0,,150.0,...,150000.0,9.0,,,,,0.0,71522.0,,150000.0


In [23]:
# odometer_km - determine number of unique values
autos['odometer_km'].unique()

array([150000.,  70000.,  50000.,  80000.,  10000.,  30000., 125000.,
        90000.,  20000.,  60000.,   5000., 100000.,  40000.])

In [24]:
# odometer_km - view descriptive statistics
autos['odometer_km'].describe(percentiles=[.01,.25,.50,.75,.99])

count     49986.000000
mean     125736.506222
std       40038.133399
min        5000.000000
1%         5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
99%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [25]:
# odometer_km - view values
autos['odometer_km'].value_counts()

150000.0    32416
125000.0     5169
100000.0     2168
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1025
5000.0        966
40000.0       818
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64

These all appear to be valid values. No additional removal is needed.

## Additional Data Cleaning - dates

There are three columns with date information that are currently stored as strings that we identified above: `date_crawled`, `last_seen`, and `ad_created`.  These columns all seem to be in the format of CCYY-MM-DD HH:MM:SS.

In [26]:
# date_crawled
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

It appears that `date_crawled` ranges from March 5, 2016 to April 7, 2016, though the frequency tails off for the last couple days in April in the data.

In [27]:
# ad_created
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
2016-01-03    0.000020
2016-01-07    0.000020
2016-01-10    0.000040
2016-01-13    0.000020
2016-01-14    0.000020
2016-01-16    0.000020
2016-01-22    0.000020
2016-01-27    0.000060
2016-01-29    0.000020
2016-02-01    0.000020
2016-02-02    0.000040
2016-02-05    0.000040
2016-02-07    0.000020
2016-02-08    0.000020
2016-02-09    0.000040
2016-02-11    0.000020
2016-02-12    0.000060
2016-02-14    0.000040
2016-02-16    0.000020
2016-02-17    0.000020
2016-02-18    0.000040
2016-02-19    0.000060
2016-02-20    0.000040
2016-02-21    0.000060
                ...   
2016-03-09    0.033229
2016-03-10    0.031869
2016-03-11    0.032789
2016-03-12    0.036610
2016-03-13    0.016925
2016-03-14    0.035230
2016-03-15    0.033749
2016-03-16    0.030008
2016-03-17    0.031189
2016-03-18    0.013724
2016-03-19    0.033849
2016-03-20    0.037871
2016-03-21 

It appears that ad_created ranges from June 11, 2015 to April 7, 2016. Much less of the data has `ad_created` in June of 2015, so it seems that there are fewer items created and still on later in 2016 and more recently created items.

In [28]:
# last_seen
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

`last_seen` looks very similar to `date_crawled` in terms of ranges of dates.  However, the distribution is much more skewed to the last few days in the range of April.

## Additional Data Cleaning - registration

Let's also look at registration and the distribution of years. Earlier we identified that there were some bad values in this field.

In [29]:
autos['registration_year'].describe()

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

We now know that this data is from 2016, so there should not be any values greater than 2016 in our data - those that are must be inaccurate values.  We should look closer at the minimum valid value.

In [33]:
# review the lowest values in registration_year
autos.loc[(autos['registration_year'] < 1930), 'registration_year'].unique()

array([1910, 1800, 1927, 1929, 1000, 1111, 1500, 1001])

The values of 1910, 1927, and 1929 could be legitimate values.  Let's remove everything else (so we can stay with 1900 as a lower bound).  We will also remove all values greater than 2016.

In [34]:
autos = autos.loc[autos['registration_year'].between(1900,2016), :]

In [35]:
# review after removing outliers
autos['registration_year'].value_counts(normalize=True).sort_index()

1910    0.000187
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
1937    0.000083
1938    0.000021
1939    0.000021
1941    0.000042
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000042
1952    0.000021
1953    0.000021
1954    0.000042
1955    0.000042
1956    0.000104
1957    0.000042
1958    0.000083
1959    0.000146
1960    0.000687
1961    0.000125
1962    0.000083
1963    0.000187
1964    0.000250
1965    0.000354
1966    0.000458
1967    0.000562
1968    0.000541
          ...   
1987    0.001562
1988    0.002957
1989    0.003770
1990    0.008226
1991    0.007414
1992    0.008122
1993    0.009268
1994    0.013745
1995    0.027324
1996    0.030073
1997    0.042236
1998    0.051087
1999    0.062438
2000    0.069852
2001    0.056273
2002    0.052753
2003    0.056794
2004    0.057002
2005    0.062792
2006    0.056377
2007    0.047984
2008    0.046464
2009    0.043673
2010    0.033260
2011    0.034030
2012    0.027553
2013    0.016786
2014    0.0138

Most of the cars remaining in the data are within about 20 years from the date of our dataset (2016).  This seems reasonable.

## Data Investigation - brand

Next, let's look further at the brand column and determine what information we can collect from it.

In [39]:
autos['brand'].value_counts()

volkswagen        10185
bmw                5283
opel               5194
mercedes_benz      4579
audi               4149
ford               3350
renault            2274
peugeot            1418
fiat               1242
seat                873
skoda               770
mazda               727
nissan              725
citroen             668
smart               668
toyota              599
sonstige_autos      523
hyundai             473
volvo               444
mini                415
mitsubishi          391
honda               377
kia                 341
alfa_romeo          318
porsche             293
suzuki              284
chevrolet           274
chrysler            176
dacia               123
daihatsu            123
jeep                108
subaru              105
land_rover           98
saab                 77
jaguar               76
trabant              75
daewoo               72
rover                65
lancia               52
lada                 29
Name: brand, dtype: int64

In [43]:
autos['brand'].value_counts(normalize=True).head(20)

volkswagen        0.212117
bmw               0.110026
opel              0.108172
mercedes_benz     0.095364
audi              0.086409
ford              0.069768
renault           0.047359
peugeot           0.029532
fiat              0.025866
seat              0.018181
skoda             0.016036
mazda             0.015141
nissan            0.015099
citroen           0.013912
smart             0.013912
toyota            0.012475
sonstige_autos    0.010892
hyundai           0.009851
volvo             0.009247
mini              0.008643
Name: brand, dtype: float64

I'm going to work with the top 20 brands, or those brands that had at least 415 rows in the data.  This captures all of the common brands - anything remaining has less than 0.86% of the brands.  Then we can evaluate the prices for these brands.

The most common brands are mostly German brands (Volkswage, BMW, Opel, Mercedes Benz, Audi), which makes sense given that we are looking at auto listings in Germany.  There are also a number of other companies - from Europe, Asia, and the US - in this grouping.

In [50]:
# determine average price by brand
top_20_brands = autos['brand'].value_counts().head(20)
brand_price = {}
for b in top_20_brands.index:
    mean_price = autos.loc[autos['brand'] == b, 'price'].mean()
    brand_price[b] = round(mean_price,2)
brand_price

{'audi': 9093.65,
 'bmw': 8102.54,
 'citroen': 3699.94,
 'fiat': 2711.8,
 'ford': 3652.1,
 'hyundai': 5308.54,
 'mazda': 4010.77,
 'mercedes_benz': 8485.24,
 'mini': 10460.01,
 'nissan': 4664.89,
 'opel': 2876.72,
 'peugeot': 3039.47,
 'renault': 2395.42,
 'seat': 4296.49,
 'skoda': 6334.92,
 'smart': 3542.71,
 'sonstige_autos': 10805.08,
 'toyota': 5115.33,
 'volkswagen': 5231.08,
 'volvo': 4757.11}

There is a decent spread of average values.  On the low side, we see Fiat, Opel, and Renault.  On the high side, we see Audi, Mini, and Sonstige.  There is also a good spread of average prices in between.  We see that the German brands identified previous span the range - serving the lower (Opel), middle (Volkswagen), and higher ends of the market (Audi, BWM, and Mercedes Benz).

## Data Investigation - brand, price, and mileage

Let's take this one step further and look at the average price and average mileage for the top nine brands.  Is there any relationship between these?

In [53]:
# create dictionaries for average prices and average mileage
top_nine_brands = autos['brand'].value_counts().head(9)

brand_price_nine = {}
for b in top_nine_brands.index:
    mean_price = autos.loc[autos['brand'] == b, 'price'].mean()
    brand_price_nine[b] = round(mean_price,2)

brand_mileage_nine = {}
for b in top_nine_brands.index:
    mean_mileage = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
    brand_mileage_nine[b] = round(mean_mileage,2)

In [54]:
# convert dictionaries to series
brand_avg_prices = pd.Series(brand_price_nine)
brand_avg_mileage = pd.Series(brand_mileage_nine)

In [55]:
brand_avgs = pd.DataFrame(brand_avg_prices, columns=['avg_prices'])

In [59]:
brand_avgs.assign(avg_mileage = brand_avg_mileage)

Unnamed: 0,avg_prices,avg_mileage
audi,9093.65,129287.78
bmw,8102.54,132431.38
fiat,2711.8,116553.95
ford,3652.1,124068.66
mercedes_benz,8485.24,130856.08
opel,2876.72,129223.14
peugeot,3039.47,127136.81
renault,2395.42,128183.82
volkswagen,5231.08,128724.1


Even though the prices vary significantly, the average mileage on these vehicles is relatively consistent and on the high side.