#### eBay Kleinanzeigen

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle.


---
The data dictionary provided with data is as follows:

* `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
* `name` - Name of the car.
* `seller` - Whether the seller is private or a dealer.
* `offerType` - The type of listing
* `price` - The price on the ad to sell the car.
* `abtest` - Whether the listing is included in an A/B test.
* `vehicleType` - The vehicle Type.
* `yearOfRegistration` - The year in which the car was first registered.
* `gearbox` - The transmission type.
* `powerPS` - The power of the car in PS.
* `model` - The car model name.
* `kilometer` - How many kilometers the car has driven.
* `monthOfRegistration` - The month in which the car was first registered.
* `fuelType` - What type of fuel the car uses.
* `brand` - The brand of the car.
* `notRepairedDamage` - If the car has a damage which is not yet repaired.
* `dateCreated` - The date on which the eBay listing was created.
* `nrOfPictures` - The number of pictures in the ad.
* `postalCode` - The postal code for the location of the vehicle.
* `lastSeenOnline` - When the crawler saw this ad last online.

---
** The aim of this project is to clean the data and analyze the included used car listings.

We start by importing the Pnadas and Numpy libraries and reading the file into the program.

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [2]:
print(autos.info())
print('/n')
print(autos.head())
print('/n')
print(autos.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Our dataframe has 50,000 rows and 20 coloumns, its a mixture of strings and numerals, the name coloumn is moxed and most of the lables are in German, although easy to understand. 

Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

In [3]:
print(autos.columns)
print('/n')
import re

col = autos.columns
new_col = []

for c in col:
    if c == 'yearOfRegistration':
        c = 'registration_year'
        
    if c == 'monthOfRegistration':
        c = 'registration_month'
        
    if c == 'notRepairedDamage':
        c = 'unrepaired_damage'
        
    if c == 'dateCreated':
        c = 'ad_created'
        
    else:
        c = re.sub(r'(?<!^)(?=[A-Z])', '_', c).lower()
        
        
    new_col.append(c)
    
autos.columns = new_col

autos.head()
##print(new_col)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
/n


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We made some changes to the column labels to make the text  "snake_case" to align with the preferred ython method. 

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
- Examples of numeric data stored as text which can be cleaned and converted

In [4]:
#if you don't include all it will look at numbers only 
autos.describe(include = 'all') 


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [5]:
def data_explorer(col_name):
    print ('Looking at:',col_name)
    print(autos[col_name].head())
    print('-------------------')
    print(autos[col_name].value_counts())
    print('\n')
    
data_explorer("date_crawled")
data_explorer("name")
data_explorer("seller")
data_explorer("offer_type")
data_explorer("price")
data_explorer("abtest")
data_explorer("vehicle_type")
data_explorer("registration_year")
data_explorer("gearbox")
data_explorer("power_p_s")
data_explorer("model")
data_explorer("odometer")
data_explorer("registration_month")
data_explorer("fuel_type")
data_explorer("unrepaired_damage")
data_explorer("ad_created")
data_explorer("nr_of_pictures")
data_explorer("postal_code")
data_explorer("last_seen")

Looking at: date_crawled
0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object
-------------------
2016-03-19 17:36:18    3
2016-03-05 16:57:05    3
2016-04-02 11:37:04    3
2016-03-14 20:50:02    3
2016-03-30 17:37:35    3
2016-03-08 10:40:35    3
2016-03-22 09:51:06    3
2016-03-21 20:37:19    3
2016-03-30 19:48:02    3
2016-03-16 21:50:53    3
2016-03-23 19:38:20    3
2016-04-02 15:49:30    3
2016-03-29 23:42:13    3
2016-04-04 16:40:33    3
2016-03-10 15:36:24    3
2016-03-23 18:39:34    3
2016-03-21 16:37:21    3
2016-03-12 16:06:22    3
2016-03-25 19:57:10    3
2016-03-11 22:38:16    3
2016-03-09 11:54:38    3
2016-03-27 22:55:05    3
2016-03-22 22:38:19    2
2016-03-10 00:49:19    2
2016-03-07 13:38:18    2
2016-03-06 08:36:22    2
2016-04-01 14:49:35    2
2016-03-21 14:47:45    2
2016-04-03 08:36:19    2
2016-03-12 20:50:40    2
                      ..
2016-03-19 22:50:20  

* date_crawled: many unique values, string
* name: many unique values string
* seller: only one value - can be discarded
* offer type:  only one value - can be discarded
* price: many unique values, numbers stored as string
* abtest: only two unique values, don't know if is needed or not split 50-50
* vehicle type: string, required
* registration year: many unique types, int64, some odd vlaues that ned to be cleaned out. 
* gearbox: only two unique values, don't know if is needed or not split 80-20 to manual
* PowerPS: many unique values, int 64
* Model: String, many unique values
* odometer: number saves as string, has km in it 
* registration_month: out of 12, and its numeral int64
* fuel type: 7 unique trype, string
* unrepaired_damage: only 2 values, dont knwo if I should keep
* ad_created: dates, saved as string, many unique values

For `price` and `odometer`, will:
* Remove any non-numeric characters.
* Convert the column to a numeric dtype.
* Use `DataFrame.rename()` to rename the column to `odometer_km`.

In [6]:

autos["price"] = autos["price"].str.replace('$','').str.replace(',','').astype(float)

autos['odometer'] = autos["odometer"].str.replace('km','').str.replace(',','').astype(float)

autos.rename(columns = {'odometer':'odometer_km'}, inplace = True)

We use data exploration for odometer_km and price to find unique and max and min values

In [20]:
## for odometer_km

print("unique values?")
print(autos["odometer_km"].unique().shape) #to see how many unique values
print('\n')
print("min/max/median/mean etc")    
print(autos["odometer_km"].describe()) #to view max and min and stats
print('\n')

print ('Looking at the first five of the value counts')
print(autos["odometer_km"].value_counts().head())
print('\n')

print ('Looking at the odometers highest and lowest value with their counts')
print (autos["odometer_km"].value_counts().sort_index(ascending = 'Ture'))

unique values?
(13,)


min/max/median/mean etc
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


Looking at the first five of the value counts
150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
Name: odometer_km, dtype: int64


Looking at the odometers highest and lowest value with their counts
5000.0        967
10000.0       264
20000.0       784
30000.0       789
40000.0       819
50000.0      1027
60000.0      1164
70000.0      1230
80000.0      1436
90000.0      1757
100000.0     2169
125000.0     5170
150000.0    32424
Name: odometer_km, dtype: int64


There are 13 unique ododmeter values, but the majority f samples are concentrated around the 125000 to 150000 km mark, so we are going to use those only. 

In [23]:
#For Price

print("unique values?")
print(autos["price"].unique().shape) #to see how many unique values
print('\n')
print("min/max/median/mean etc")    
print(autos["price"].describe()) #to view max and min and stats
print('\n')

print ('Looking at the first five of the value counts')
print(autos["price"].value_counts().head(100))
print('\n')

print ('Looking at the price highest and lowest value with their counts')
print (autos["price"].value_counts().sort_index(ascending = 'Ture'))

unique values?
(2357,)


min/max/median/mean etc
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


Looking at the first five of the value counts
0.0        1421
500.0       781
1500.0      734
2500.0      643
1200.0      639
1000.0      639
600.0       531
800.0       498
3500.0      498
2000.0      460
999.0       434
750.0       433
900.0       420
650.0       419
850.0       410
700.0       395
4500.0      394
300.0       384
2200.0      382
950.0       379
1100.0      376
1300.0      371
3000.0      365
550.0       356
1800.0      355
5500.0      340
1250.0      335
350.0       335
1600.0      327
1999.0      322
           ... 
4999.0      174
2950.0      173
2750.0      169
6900.0      165
7000.0      165
8000.0      161
1499.0      157
1.0         156
7900.0      156
5200.0      156
3700.0      155
1550.0      150
990.0      

There are 2357 unique price values in this column. By far the majority is for price of 0, and then there is a slow decreses for the pricing. TO remove the ourliers we remove everything above 10000 dollars in price. 

In [29]:
##removing outliers for odometer and price 

autos = autos[autos["odometer_km"].between(124999,150001)]

autos = autos[autos["price"].between(0,10000)]

autos.describe(include = 'all')


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,34307,34307,34307,34307,34307.0,34307,30425,34307.0,32457,34307.0,32394,34307.0,34307.0,30907,34307,26583,34307,34307.0,34307.0,34307
unique,33466,26995,2,2,,2,8,,2,,231,,,7,40,2,59,,,29666
top,2016-03-05 16:57:05,BMW_316i,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-04 00:00:00,,,2016-04-07 06:17:27
freq,3,72,34306,34306,,17679,9120,,26954,,3047,,,20625,7835,22451,1310,,,6
mean,,,,,2764.635876,,,2002.253942,,106.201067,,146896.405981,5.653511,,,,,0.0,50063.285743,
std,,,,,2455.170488,,,38.330057,,205.570936,,8243.757369,3.761821,,,,,0.0,25556.179498,
min,,,,,0.0,,,1910.0,,0.0,,125000.0,0.0,,,,,0.0,1067.0,
25%,,,,,899.0,,,1998.0,,65.0,,150000.0,3.0,,,,,0.0,29481.5,
50%,,,,,1950.0,,,2002.0,,102.0,,150000.0,6.0,,,,,0.0,48529.0,
75%,,,,,3999.0,,,2005.0,,140.0,,150000.0,9.0,,,,,0.0,69242.0,


We now have 34307 values, we can confirm that the outlier values for price and odometer have been removed. It seems that there were around 16000 outliers.

In [40]:


#for date_crawled
date_crawled_stats = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)*100
print ('date_crawled_stats')
print(date_crawled_stats.sort_index())
print('\n')

#for ad_created
ad_created_stats = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)*100
print ('ad_created_stats')
print(ad_created_stats.sort_index())
print('\n')

#for last_seen
last_seen_stats = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)*100
print ('last_seen_stats')
print(last_seen_stats.sort_index())


date_crawled_stats
2016-03-05    2.468884
2016-03-06    1.399131
2016-03-07    3.643571
2016-03-08    3.471595
2016-03-09    3.401638
2016-03-10    3.247151
2016-03-11    3.215087
2016-03-12    3.669805
2016-03-13    1.495322
2016-03-14    3.757251
2016-03-15    3.369575
2016-03-16    3.019792
2016-03-17    3.203428
2016-03-18    1.332090
2016-03-19    3.285044
2016-03-20    3.774740
2016-03-21    3.774740
2016-03-22    3.293788
2016-03-23    3.285044
2016-03-24    2.847815
2016-03-25    3.276299
2016-03-26    3.232576
2016-03-27    3.075174
2016-03-28    3.454106
2016-03-29    3.346256
2016-03-30    3.378319
2016-03-31    3.218002
2016-04-01    3.264640
2016-04-02    3.459935
2016-04-03    3.745591
2016-04-04    3.748506
2016-04-05    1.372898
2016-04-06    0.367272
2016-04-07    0.104935
Name: date_crawled, dtype: float64


ad_created_stats
2015-12-05    0.002915
2016-01-07    0.002915
2016-01-13    0.002915
2016-01-16    0.002915
2016-01-27    0.002915
2016-01-29    0.002915
2016-02

* date crawled: Just over a month with pretty unform distribution 
* ad_created: Just over 3 months, with very low percentage over december and january and being uniform around march.
* last seen: Just over a month (same as the date crawled in terms of range, but this one is manily concentrated on 3 dats of 05, 06, 07 of April 2016.

In [41]:
autos['registration_year'].describe()

count    34307.000000
mean      2002.253942
std         38.330057
min       1910.000000
25%       1998.000000
50%       2002.000000
75%       2005.000000
max       9000.000000
Name: registration_year, dtype: float64

In [43]:
autos['registration_year'].value_counts().sort_index()

1910       1
1937       1
1953       1
1958       1
1959       1
1960       7
1961       1
1962       1
1963       2
1965       3
1966       4
1967       3
1968       5
1970      10
1971       2
1972       7
1973      13
1974       4
1975       5
1976       9
1977       9
1978      16
1979      15
1980      40
1981      10
1982      26
1983      28
1984      26
1985      56
1986      46
        ... 
1991     285
1992     319
1993     387
1994     577
1995    1137
1996    1296
1997    1853
1998    2203
1999    2722
2000    2954
2001    2388
2002    2212
2003    2347
2004    2285
2005    2335
2006    1892
2007    1368
2008    1000
2009     626
2010     296
2011     185
2012      66
2013      13
2014       3
2015       4
2016    1068
2017    1161
2018     375
2019       2
9000       1
Name: registration_year, Length: 64, dtype: int64

I can see that the first value of 40 occures at 1980. So I decided that this is the cut off date, there are earlier cars but too much of an outlier. 

an interesting thing is that 2017 has over 1000 cars listed. This is obviously wrong. So I will leave the 2017 ones. Hence the expression below:

In [45]:
autos = autos[autos["registration_year"].between(1979,2017)]
autos["registration_year"].value_counts(normalize = True)*100

2000    8.733702
1999    8.047778
2001    7.060284
2003    6.939065
2005    6.903586
2004    6.755758
2002    6.539928
1998    6.513319
2006    5.593827
1997    5.478521
2007    4.044585
1996    3.831712
2017    3.432575
1995    3.361618
2016    3.157615
2008    2.956568
2009    1.850812
1994    1.705940
1993    1.144192
1992    0.943145
2010    0.875144
1990    0.872188
1991    0.842622
2011    0.546965
1989    0.408006
1988    0.301570
2012    0.195133
1987    0.174438
1985    0.165568
1986    0.136002
1980    0.118263
1983    0.082784
1984    0.076871
1982    0.076871
1979    0.044349
2013    0.038435
1981    0.029566
2015    0.011826
2014    0.008870
Name: registration_year, dtype: float64

We can see that the highest number of cars belongs to 2000, 1999, 2001, and 2003, and then it tapers of after that.

In [51]:
autos["brand"].unique()

array(['peugeot', 'bmw', 'ford', 'chrysler', 'volkswagen', 'seat',
       'renault', 'mercedes_benz', 'audi', 'opel', 'mazda', 'mini',
       'toyota', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'fiat',
       'skoda', 'subaru', 'sonstige_autos', 'citroen', 'smart', 'kia',
       'porsche', 'hyundai', 'honda', 'daewoo', 'suzuki', 'dacia',
       'land_rover', 'chevrolet', 'jaguar', 'alfa_romeo', 'daihatsu',
       'rover', 'lancia', 'trabant', 'lada'], dtype=object)

In [58]:
print(autos["brand"].value_counts(normalize = True)*100) #%percentages
autos["brand"].value_counts() #normal count


volkswagen        22.827662
opel              12.497413
bmw               10.646601
mercedes_benz      9.147621
audi               8.083257
ford               7.225852
renault            5.419389
peugeot            3.204920
fiat               2.513083
seat               1.883334
mazda              1.658635
nissan             1.398457
citroen            1.368891
skoda              1.274281
toyota             1.138279
volvo              1.064364
smart              0.952015
mitsubishi         0.875144
honda              0.827839
alfa_romeo         0.762795
hyundai            0.742099
kia                0.606096
sonstige_autos     0.487834
suzuki             0.473051
chrysler           0.419833
mini               0.405050
chevrolet          0.345918
daihatsu           0.248352
subaru             0.227656
saab               0.209916
jeep               0.183307
rover              0.171481
daewoo             0.159655
dacia              0.136002
land_rover         0.112350
lancia             0

volkswagen        7721
opel              4227
bmw               3601
mercedes_benz     3094
audi              2734
ford              2444
renault           1833
peugeot           1084
fiat               850
seat               637
mazda              561
nissan             473
citroen            463
skoda              431
toyota             385
volvo              360
smart              322
mitsubishi         296
honda              280
alfa_romeo         258
hyundai            251
kia                205
sonstige_autos     165
suzuki             160
chrysler           142
mini               137
chevrolet          117
daihatsu            84
subaru              77
saab                71
jeep                62
rover               58
daewoo              54
dacia               46
land_rover          38
lancia              37
jaguar              36
porsche             19
lada                 5
trabant              5
Name: brand, dtype: int64

In [54]:
autos["brand"].value_counts().index #brand names in the order of descent

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'mazda', 'nissan', 'citroen', 'skoda',
       'toyota', 'volvo', 'smart', 'mitsubishi', 'honda', 'alfa_romeo',
       'hyundai', 'kia', 'sonstige_autos', 'suzuki', 'chrysler', 'mini',
       'chevrolet', 'daihatsu', 'subaru', 'saab', 'jeep', 'rover', 'daewoo',
       'dacia', 'land_rover', 'lancia', 'jaguar', 'porsche', 'lada',
       'trabant'],
      dtype='object')

I would like to use the first 10 cars.

In [67]:
mean_brand_price = {}

loop_array = autos["brand"].value_counts().index[0:10]

for b in loop_array:
    subset = autos[autos["brand"] == b]
    price_mean = subset["price"].mean()
    mean_brand_price [b]= price_mean
    
mean_brand_price

{'audi': 3736.0413313825898,
 'bmw': 3902.1266314912523,
 'fiat': 1654.2670588235294,
 'ford': 2033.9971358428804,
 'mercedes_benz': 3705.546218487395,
 'opel': 1910.7251005441212,
 'peugeot': 2011.728782287823,
 'renault': 1572.1974904528097,
 'seat': 2449.098901098901,
 'volkswagen': 2792.4991581401373}

The top 3 expensive cars are BMW, Mercedes and Audi, the next tier are VW,seat,and ford, and peuget, and the cheapest are:opel, renault, and fiat.