# Cleaning & Analysis of cars from eBay Kleinanzeigen

**Note:** eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user [orgesleka](https://www.kaggle.com/orgesleka).

The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

## Dataset Alterations

* this is only a sampling of 50,000 data points from the full dataset.
* this version was 'dirtied' to simulate a freshly scrapped dataset.

## Column Header Info

* **dateCrawled**  - When this ad was first crawled. All field-values are taken from this date.
* **name**  - Name of the car.
* **seller**  - Whether the seller is private or a dealer.
* **offerType**  - The type of listing
* **price**  - The price on the ad to sell the car.
* **abtest**  - Whether the listing is included in an A/B test.
* **vehicleType**  - The vehicle Type.
* **yearOfRegistration**  - The year in which the car was first registered.
* **gearbox**  - The transmission type.
* **powerPS**  - The power of the car in PS.
* **model**  - The car model name.
* **kilometer**  - How many kilometers the car has driven.
* **monthOfRegistration** - The month in which the car was first registered.
* **fuelType**  - What type of fuel the car uses.
* **brand**  - The brand of the car.
* **notRepairedDamage**  - If the car has a damage which is not yet repaired.
* **dateCreated**  - The date on which the eBay listing was created.
* **nrOfPictures**  - The number of pictures in the ad.
* **postalCode**  - The postal code for the location of the vehicle.
* **lastSeenOnline**  - When the crawler saw this ad last online.

In [191]:
import numpy as np
import pandas as pd

In [192]:
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [193]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [194]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [195]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Initial Data Observations
* price is in USD
* price is also in text
* odometer is in km
* odometer is in text
* date columns (dataCrawled, dateCreated, lastSeen) have the same format: Y M D  H M S
* yearOfRegistration & monthOfRegistration date is literaly just a integers

* there are some null values in a few of the columns
* columns use camel case format

In [196]:
# changing the camel case columns to snake_case
case_map = {
    'dateCrawled':'date_crawled',
    'offerType':'offer_type',
    'vehicleType':'vehicle_type',
    'yearOfRegistration':'registration_year',
    'powerPS':'power_ps',
    'monthOfRegistration':'registration_month',
    'fuelType':'fuel_type',
    'notRepairedDamage':'unrepaired_damage',
    'dateCreated':'ad_created',
    'nrOfPictures':'photo_count',
    'postalCode':'postal_code',
    'lastSeen':'last_seen'
}
autos.rename(case_map, axis=1, inplace=True)

In [197]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,photo_count,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [198]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,photo_count,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


## Columns with mostly Same Data
(to be dropped)
* seller
* offerType
* abtest
* gearbox
* unrepaired_damage

## Columns that need more investigation
* investigate the null value in:
    * vehicleType
    * gearbox
    * model
    * fuelType
    * notRepairedDamage
* there are A LOT of cars at 150,000km odometer (unusual)

## Columns that need cleaning
* price column needs conversion to integer
* odometer column needs conversion to integer

In [199]:
# converting price & odometer columns to integer
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)

In [200]:
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

In [201]:
autos.dtypes

date_crawled          object
name                  object
seller                object
offer_type            object
price                  int64
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer_km            int64
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
photo_count            int64
postal_code            int64
last_seen             object
dtype: object

In [202]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,photo_count,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Analysing odometer_km

In [203]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [204]:
autos['odometer_km'].unique().shape

(13,)

In [205]:
freq = autos['odometer_km'].value_counts()
freq

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [206]:
freq.sort_index(ascending=True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

## Analysing Price

In [207]:
autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [208]:
autos['price'].unique().shape

(2357,)

In [209]:
freq = autos['price'].value_counts()
freq.sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

There are an unusual number of extremely low price listings. Especially 0 and 1. 

In [210]:
freq.sort_index(ascending=False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

The Values seem to spike upwards wildly after $340,000.

Going to constrain the data in price columns from 2-340,000

In [211]:
autos = autos[autos['price'].between(2,340000)]
freq = autos['price'].value_counts()
freq.sort_index(ascending=True).head(5)

2    3
3    1
5    2
8    1
9    1
Name: price, dtype: int64

In [212]:
freq.sort_index(ascending=False).head(5)

299000    1
295000    1
265000    1
259000    1
250000    1
Name: price, dtype: int64

In [213]:
autos[['date_crawled','ad_created','last_seen']].head(5)

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [214]:
just_date_crawled = autos['date_crawled'].str[:10]
print(just_date_crawled)

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
            ...    
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 48407, dtype: object


In [215]:
(just_date_crawled
 .value_counts(normalize=True, dropna=False)
 .sort_index()
)

2016-03-05    0.025368
2016-03-06    0.014068
2016-03-07    0.036049
2016-03-08    0.033280
2016-03-09    0.033053
2016-03-10    0.032206
2016-03-11    0.032599
2016-03-12    0.036957
2016-03-13    0.015659
2016-03-14    0.036627
2016-03-15    0.034272
2016-03-16    0.029521
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034747
2016-03-20    0.037804
2016-03-21    0.037309
2016-03-22    0.032929
2016-03-23    0.032268
2016-03-24    0.029397
2016-03-25    0.031566
2016-03-26    0.032247
2016-03-27    0.031111
2016-03-28    0.034850
2016-03-29    0.034127
2016-03-30    0.033714
2016-03-31    0.031814
2016-04-01    0.033735
2016-04-02    0.035491
2016-04-03    0.038589
2016-04-04    0.036482
2016-04-05    0.013077
2016-04-06    0.003161
2016-04-07    0.001384
Name: date_crawled, dtype: float64

There is an unusual drop off in number crawlings on 2016-04-06 & 2016-04-07

In [216]:
(autos['last_seen']
 .str[0:10]
 .value_counts(normalize=True, dropna=False)
 .sort_index()
)

2016-03-05    0.001074
2016-03-06    0.004338
2016-03-07    0.005412
2016-03-08    0.007375
2016-03-09    0.009627
2016-03-10    0.010618
2016-03-11    0.012374
2016-03-12    0.023798
2016-03-13    0.008862
2016-03-14    0.012622
2016-03-15    0.015865
2016-03-16    0.016444
2016-03-17    0.028074
2016-03-18    0.007334
2016-03-19    0.015824
2016-03-20    0.020638
2016-03-21    0.020617
2016-03-22    0.021381
2016-03-23    0.018592
2016-03-24    0.019749
2016-03-25    0.019191
2016-03-26    0.016816
2016-03-27    0.015597
2016-03-28    0.020885
2016-03-29    0.022331
2016-03-30    0.024748
2016-03-31    0.023840
2016-04-01    0.022869
2016-04-02    0.024852
2016-04-03    0.025203
2016-04-04    0.024501
2016-04-05    0.124941
2016-04-06    0.221600
2016-04-07    0.132006
Name: last_seen, dtype: float64

The last_seen date count seems to ramp up slowly initial and has an anomalous drop off on 2016-03-13.  Also there is large spike in dates in the last 3 days: 4/5-4/7. Could this be representative of the currently active listings?

In [217]:
(autos['ad_created']
 .str[0:10]
 .value_counts(normalize=True, dropna=False)
 .sort_index()
)

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038837
2016-04-04    0.036854
2016-04-05    0.011796
2016-04-06    0.003243
2016-04-07    0.001239
Name: ad_created, Length: 76, dtype: float64

In [218]:
autos['registration_year'].describe()

count    48407.000000
mean      2004.773938
std         88.785092
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

likely the age of the car. min and max values are too extreme.

for maximum values, I've decided to remove entries with registration_year dates greater than the newest last_seen date. Same day registration IS possible.

minimum values.  removing values less than 1900 as an arbitrary cutoff.

In [219]:
autos = autos[autos['registration_year'].between(1900,2016)]

In [220]:
autos['registration_year'].value_counts(normalize=True)

2000    0.067227
2005    0.062907
1999    0.062026
2004    0.058028
2003    0.057964
          ...   
1939    0.000021
1948    0.000021
1931    0.000021
1927    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64

## Average price of the most popular Brands

In [221]:
# getting the most popular brands
common_brands = autos['brand'].value_counts(normalize=True)
common_brands = common_brands[common_brands>.05]
print(common_brands)

volkswagen       0.211352
bmw              0.109953
opel             0.107374
mercedes_benz    0.096456
audi             0.086613
ford             0.069978
Name: brand, dtype: float64


In [222]:
#putting a brands average price to their dict entry
brand_prices = {}
for brand in common_brands.index:
    brand_listings = autos.loc[autos['brand']==brand,'price']
    brand_prices[brand] = int(brand_listings.mean())

print(brand_prices)

{'volkswagen': 5417, 'bmw': 8367, 'opel': 2990, 'mercedes_benz': 8657, 'audi': 9362, 'ford': 3757}


In [223]:
brand_mileage = {}
for brand in common_brands.index:
    brand_odometers = autos.loc[autos['brand']==brand,'odometer_km']
    brand_mileage[brand] = int(brand_odometers.mean())

print(brand_mileage)

{'volkswagen': 128708, 'bmw': 132553, 'opel': 129322, 'mercedes_benz': 130838, 'audi': 129208, 'ford': 124210}


In [224]:
brand_mileage_series = pd.Series(data=brand_mileage)
print(brand_mileage_series)

volkswagen       128708
bmw              132553
opel             129322
mercedes_benz    130838
audi             129208
ford             124210
dtype: int64


In [225]:
brand_price_series = pd.Series(data=brand_prices)
print(brand_price_series)

volkswagen       5417
bmw              8367
opel             2990
mercedes_benz    8657
audi             9362
ford             3757
dtype: int64


In [226]:
brand_df = pd.DataFrame(data=brand_price_series, columns=['mean_price'])
print(brand_df)

               mean_price
volkswagen           5417
bmw                  8367
opel                 2990
mercedes_benz        8657
audi                 9362
ford                 3757


In [230]:
brand_df['mean_mileage'] = brand_mileage_series
brand_df.sort_values(by='mean_price')

Unnamed: 0,mean_price,mean_mileage
opel,2990,129322
ford,3757,124210
volkswagen,5417,128708
bmw,8367,132553
mercedes_benz,8657,130838
audi,9362,129208


There seems to be little correlation between the mileage & price.