## Exploring Ebay Car Sales Data

In this project, I'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.

name - Name of the car.

seller - Whether the seller is private or a dealer.

offerType - The type of listing

price - The price on the ad to sell the car.

abtest - Whether the listing is included in an A/B test.

vehicleType - The vehicle Type.

yearOfRegistration - The year in which which year the car was first registered.

gearbox - The transmission type.

powerPS - The power of the car in PS.

model - The car model name.

kilometer - How many kilometers the car has driven.

monthOfRegistration - The month in which which year the car was first registered.

fuelType - What type of fuel the car uses.

brand - The brand of the car.

notRepairedDamage - If the car has a damage which is not yet repaired.

dateCreated - The date on which the eBay listing was created.

nrOfPictures - The number of pictures in the ad.

postalCode - The postal code for the location of the vehicle.

lastSeenOnline - When the crawler saw this ad last online.

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv",encoding="Latin-1")

In [2]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


The dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.

I'll start by cleaning the column names to make the data easier to work with.

## Cleaning column names

In [3]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [4]:
mapping_dict = {
    'name':'name',
    'seller':'seller',
    'price':'price',
    'abtest':'abtest',
    'gearbox':'gearbox',
    'model':'model',
    'odometer':'odometer_km',
    'brand':'brand',
    'yearOfRegistration': 'registration_year',
    'monthOfRegistration': 'registration_month',
    'notRepairedDamage': 'unrepaired_damage',
    'dateCreated': 'ad_created',
    'dateCrawled': 'ad_crawled',
    'offerType': 'offer_type',
    'vehicleType': 'vehicle_type',
    'powerPS': 'power_ps',
    'fuelType': 'fuel_type',
    'nrOfPictures': 'nr_of_pictures',
    'postalCode': 'postal_code',
    'lastSeen': 'ad_last_seen',
}

autos.rename(columns=mapping_dict,inplace=True)
print(autos.columns)

Index(['ad_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'ad_last_seen'],
      dtype='object')


In [5]:
autos.head()

Unnamed: 0,ad_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,ad_last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


I've changed all the camelcases to snakecases (prefereed in Python) in order to make the columns more readable

## Initial exploration and cleaning

In [6]:
autos.describe(include="all")

Unnamed: 0,ad_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,ad_last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Our initial observations:

- There are some text columns where all (or nearly all) of the values are the same (seller, offer_type).
- The num_photos column looks odd, we'll need to investigate this further.

In [7]:
autos["nr_of_pictures"].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

The column "nr_of_pictures" contains the same value for each entry.
I should drop the columns seller, offer_type and nr_of_pictures.

In [8]:
autos = autos.drop(["nr_of_pictures", "seller", "offer_type"], axis=1)

There are two columns, price and odometer, which are numeric values with extra characters being stored as text. I'll clean and convert these.

In [9]:
autos["price"].value_counts()
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
autos["price"] = autos["price"].astype(float)
print(autos["price"].head())

0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
Name: price, dtype: float64


In [10]:
autos["odometer_km"].value_counts()
autos["odometer_km"] = autos["odometer_km"].str.replace('km','')
autos["odometer_km"] = autos["odometer_km"].str.replace(',','')
autos["odometer_km"] = autos["odometer_km"].astype(float)
print(autos["odometer_km"].head())

0    150000.0
1    150000.0
2     70000.0
3     70000.0
4    150000.0
Name: odometer_km, dtype: float64


## Odometer and Price columns

In [11]:
print(autos["price"].describe())
print(autos["odometer_km"].describe())
print(autos["price"].value_counts().sort_index(ascending=True))

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
0.0           1421
1.0            156
2.0              3
3.0              1
5.0              2
8.0              1
9.0              1
10.0             7
11.0             2
12.0             3
13.0             2
14.0             1
15.0             2
17.0             3
18.0             1
20.0             4
25.0             5
29.0             1
30.0             7
35.0             1
40.0             6
45.0             4
47.0             1
49.0             4
50.0            49
55.0             2
59.0             1
60.0             9
65.0             5
66.0             1
        

In [12]:
#print(autos["price"].describe())
#print(autos["odometer_km"].describe())
#print(autos["price"].value_counts().sort_index(ascending=True))
autos.loc[~(autos["price"].between(100,100000)),"price"]=np.nan
print(autos["price"].value_counts().sort_index(ascending=True))

100.0      134
110.0        3
111.0        2
115.0        2
117.0        1
120.0       39
122.0        1
125.0        8
129.0        1
130.0       15
135.0        1
139.0        1
140.0        9
145.0        2
149.0        7
150.0      224
156.0        2
160.0        8
170.0        7
173.0        1
175.0       12
179.0        1
180.0       35
185.0        1
188.0        1
190.0       16
193.0        1
195.0        2
198.0        1
199.0       41
          ... 
73500.0      1
73900.0      1
73996.0      1
74900.0      3
74999.0      2
75000.0      1
75900.0      1
75997.0      1
76997.0      1
78911.0      1
79500.0      1
79933.0      1
79980.0      1
79999.0      1
80000.0      3
82987.0      1
83000.0      1
84000.0      1
84997.0      1
85000.0      1
86500.0      1
88900.0      1
89000.0      1
89900.0      1
93000.0      2
93911.0      1
94999.0      1
98500.0      1
99000.0      2
99900.0      2
Name: price, Length: 2273, dtype: int64


I decided to remove the prices below 100 \$ and above 100000 \$ because I judged those prices not realistics, while I haven't removed any value in the odometer_km column.

## Date column

There are a number of columns with date information:

- date_crawled
- registration_month
- registration_year
- ad_created
- last_seen

In [13]:
years_date_crawled=autos['ad_crawled'].str[:4]
years_date_crawled=years_date_crawled.astype(int)
print(years_date_crawled.value_counts())

2016    50000
Name: ad_crawled, dtype: int64


In [14]:
years_date_created=autos['ad_created'].str[:4]
years_date_created=years_date_created.astype(int)
print(years_date_created.value_counts())

2016    49994
2015        6
Name: ad_created, dtype: int64


In [15]:
years_last_seen=autos['ad_last_seen'].str[:4]
years_last_seen=years_last_seen.astype(int)
print(years_last_seen.value_counts())

2016    50000
Name: ad_last_seen, dtype: int64


In [16]:
autos["registration_year"].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The "registration_year" column, which indicates the year of registration of the car, has some odd values like 1000 or 9999. I have to clean the column.

In [17]:
print(autos["registration_year"].value_counts().sort_index())

1000       1
1001       1
1111       1
1500       1
1800       2
1910       9
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       5
1957       2
1958       4
1959       7
1960      34
1961       6
1962       4
1963       9
        ... 
2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, Length: 97, dtype: int64


In [18]:
month_date_crawled=autos['ad_crawled'].str[5:7]
month_date_crawled=month_date_crawled.astype(int)
print(month_date_crawled.value_counts())

3    41895
4     8105
Name: ad_crawled, dtype: int64


In [19]:
day_date_crawled=autos['ad_crawled'].str[8:10]
day_date_crawled=day_date_crawled.astype(int)
print(day_date_crawled.value_counts())

3     1934
5     1924
20    1891
21    1876
7     1869
12    1839
14    1831
4     1826
2     1770
19    1745
28    1742
29    1709
15    1699
1     1690
30    1681
8     1665
9     1661
22    1647
26    1624
11    1624
23    1619
10    1606
31    1596
25    1587
17    1576
27    1552
16    1475
24    1455
6      856
13     778
18     653
Name: ad_crawled, dtype: int64


In [20]:
month_date_created=autos['ad_created'].str[5:7]
month_date_created=month_date_created.astype(int)
print(month_date_created.value_counts())
day_date_created=autos['ad_created'].str[8:10]
day_date_created=day_date_created.astype(int)
print(day_date_created.value_counts())
month_ad_last_seen=autos['ad_last_seen'].str[5:7]
month_ad_last_seen=month_ad_last_seen.astype(int)
print(month_ad_last_seen.value_counts())
day_ad_last_seen=autos['ad_last_seen'].str[8:10]
day_ad_last_seen=day_ad_last_seen.astype(int)
print(day_ad_last_seen.value_counts())

3     41866
4      8053
2        63
1        12
12        2
11        1
9         1
8         1
6         1
Name: ad_created, dtype: int64
3     1990
4     1916
20    1895
21    1889
12    1834
7     1803
14    1764
2     1761
28    1758
5     1747
29    1716
1     1696
19    1695
15    1687
30    1673
8     1668
9     1665
22    1642
11    1641
26    1630
23    1613
25    1597
10    1597
31    1596
17    1561
27    1554
16    1502
24    1456
6      919
13     847
18     688
Name: ad_created, dtype: int64
4    28709
3    21291
Name: ad_last_seen, dtype: int64
6     11271
7      6814
5      6268
17     1396
3      1268
2      1245
30     1242
4      1231
31     1192
12     1191
1      1155
29     1117
22     1079
28     1043
21     1037
20     1035
24      978
25      960
23      929
26      848
16      822
27      801
15      794
19      787
14      640
11      626
10      538
9       493
13      449
8       380
18      371
Name: ad_last_seen, dtype: int64


In [21]:
print(autos["registration_month"].value_counts().sort_index())

0     5075
1     3282
2     3008
3     5071
4     4102
5     4107
6     4368
7     3949
8     3191
9     3389
10    3651
11    3360
12    3447
Name: registration_month, dtype: int64


In [22]:
autos_wo=autos[autos["registration_year"].between(1950,2016)]
print(autos_wo)

                ad_crawled                                               name  \
0      2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1      2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2      2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3      2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4      2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   
5      2016-03-21 13:47:45  Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...   
6      2016-03-20 17:55:21  VW_Golf_III_GT_Special_Electronic_Green_Metall...   
7      2016-03-16 18:55:19                               Golf_IV_1.9_TDI_90PS   
8      2016-03-22 16:51:34                                         Seat_Arosa   
9      2016-03-16 13:47:02          Renault_Megane_Scenic_1.6e_RT_Klimaanlage   
11     2016-03-16 18:45:34                         Mercedes_A140_Motorschaden   
12     2016-03-31 19:48:22  

I decided to remove all the rows with registration year  below 1950, it seems difficult to believe that someone is selling such old cars.

In [23]:
autos_wo["registration_year"].value_counts(normalize=True)

2000    0.069869
2005    0.062807
1999    0.062495
2004    0.057016
2003    0.056808
2006    0.056412
2001    0.056308
2002    0.052766
1998    0.051100
2007    0.047996
2008    0.046475
2009    0.043705
1997    0.042246
2011    0.034039
2010    0.033268
1996    0.030081
2012    0.027560
2016    0.027414
1995    0.027352
2013    0.016790
2014    0.013874
1994    0.013749
1993    0.009270
2015    0.008312
1990    0.008228
1992    0.008145
1991    0.007416
1989    0.003771
1988    0.002958
1985    0.002187
          ...   
1982    0.000896
1972    0.000729
1979    0.000729
1960    0.000708
1981    0.000646
1967    0.000562
1976    0.000562
1971    0.000562
1968    0.000542
1973    0.000542
1974    0.000500
1966    0.000458
1977    0.000458
1969    0.000396
1975    0.000396
1965    0.000354
1964    0.000250
1963    0.000187
1959    0.000146
1961    0.000125
1956    0.000104
1962    0.000083
1958    0.000083
1950    0.000062
1955    0.000042
1957    0.000042
1954    0.000042
1951    0.0000

The distribution seems peaked in the 2000-2005 years.

## Exploring Price by Brand

In [24]:
import operator
mean_price={}
brands = autos_wo["brand"].unique()
for b in brands:
    mean_price_brand = autos_wo[autos_wo["brand"] == b]
    mean_value=mean_price_brand["price"].mean(skipna=True)
    mean_price[b]=mean_value
sorted_means=sorted(mean_price.items(),key=operator.itemgetter(1))
print(sorted_means)

[('daewoo', 1064.0579710144928), ('rover', 1602.2903225806451), ('daihatsu', 1649.655172413793), ('trabant', 1846.5238095238096), ('renault', 2496.940394314535), ('lada', 2688.296296296296), ('fiat', 2836.8736310025274), ('opel', 3005.9309720265646), ('peugeot', 3113.860549132948), ('saab', 3211.6493506493507), ('mitsubishi', 3439.10290237467), ('lancia', 3444.877551020408), ('chrysler', 3486.5766871165642), ('smart', 3596.40273556231), ('ford', 3727.959789669038), ('citroen', 3796.26267281106), ('subaru', 4033.7551020408164), ('alfa_romeo', 4100.915857605178), ('honda', 4119.109589041096), ('suzuki', 4126.341818181818), ('mazda', 4129.774787535411), ('seat', 4433.419621749409), ('nissan', 4756.659634317863), ('volvo', 4993.208037825059), ('toyota', 5167.091062394604), ('hyundai', 5411.075431034483), ('volkswagen', 5433.228618085323), ('dacia', 5915.528455284553), ('kia', 6018.442073170731), ('skoda', 6409.609724047306), ('chevrolet', 6759.885931558935), ('bmw', 8249.015481089555), ('m

For now I decided to loop over every brand to understand better the price distribution.
The dataset includes every kind of cars, from the cheapest (daewoo, rover, daihatsu) to the most expensive (jaguar, land rover, porsche).

In [25]:
mean_price_top_seller={}
counting_cars=autos_wo["brand"].value_counts()
top_brands=counting_cars.sort_values(ascending=False).head(6)
print(top_brands)
for b in top_brands.index:
    mean_price_brand = autos_wo[autos_wo["brand"] == b]
    mean_value=mean_price_brand["price"].mean(skipna=True)
    mean_price_top_seller[b]=mean_value
sorted_means=sorted(mean_price_top_seller.items(),key=operator.itemgetter(1))
print(sorted_means)

volkswagen       10187
bmw               5283
opel              5192
mercedes_benz     4578
audi              4149
ford              3349
Name: brand, dtype: int64
[('opel', 3005.9309720265646), ('ford', 3727.959789669038), ('volkswagen', 5433.228618085323), ('bmw', 8249.015481089555), ('mercedes_benz', 8570.232849162012), ('audi', 9339.529967669734)]


Of the top 5 brands, there is a distinct price gap:

- Audi, BMW and Mercedes Benz are more expensive;
- Ford and Opel are less expensive;
- Volkswagen is in between.
Volkswagen can be seen as a good trade-off and this can explain his popularity.

## Exploring Mileage by Brand

In [26]:
mean_mileage_top_seller={}
for b in top_brands.index:
    mean_mileage_brand = autos_wo[autos_wo["brand"] == b]
    mean_distance=mean_mileage_brand["odometer_km"].mean(skipna=True)
    mean_mileage_top_seller[b]=mean_distance
mileage_series=pd.Series(mean_mileage_top_seller)
price_series=pd.Series(mean_price_top_seller)
auto_aggreg=pd.DataFrame(price_series,columns=["mean_price"])
auto_aggreg["mean_distance"]=mileage_series
import pprint
pp = pprint.PrettyPrinter()
pp.pprint(auto_aggreg)

                mean_price  mean_distance
audi           9339.529968  129287.780188
bmw            8249.015481  132458.830210
ford           3727.959790  124151.985667
mercedes_benz  8570.232849  130903.232853
opel           3005.930972  129252.696456
volkswagen     5433.228618  128733.189359


The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.

## Further cleaning and analisys

In [27]:
date_crawled=autos_wo['ad_crawled'].str[:4].str[:4]+autos_wo['ad_crawled'].str[5:7]+autos_wo['ad_crawled'].str[8:10]
date_crawled=date_crawled.astype(int)
date_created=autos_wo['ad_created'].str[:4]+autos_wo['ad_created'].str[5:7]+autos_wo['ad_created'].str[8:10]
date_created=date_created.astype(int)
date_last_seen=autos_wo['ad_last_seen'].str[:4]+autos_wo['ad_last_seen'].str[5:7]+autos_wo['ad_last_seen'].str[8:10]
date_last_seen=date_last_seen.astype(int)
autos_def=autos_wo.copy()
autos_def['ad_crawled']=date_crawled
autos_def['ad_created']=date_created
autos_def['ad_last_seen']=date_last_seen

In [28]:
for c in autos_def.columns:
    if autos_def[c].dtype=="object":
        print(autos_def[c].unique())

['Peugeot_807_160_NAVTECH_ON_BOARD'
 'BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik' 'Volkswagen_Golf_1.6_United'
 ... 'Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon'
 'Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+Reifen_neu_!!'
 'Fiat_500_C_1.2_Dualogic_Lounge']
['control' 'test']
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']
['manuell' 'automatik' nan]
['andere' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan
 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo'
 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo'
 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera'
 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris'
 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito'
 'cordoba' 'galaxy' '100' 'sharan' 'octavia' 'm_klasse' 'lupo' 'fiesta'
 'superb' 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg'
 'seicento' 'avensis' 'vivaro' 'x_reihe'

I'll translate some values from German to English.

In [29]:
traduction={
    "manuell":"manual",
    "automatik":"automatic",
    "nein":"no",
    "ja":"yes"
}
autos_def["gearbox"]=autos_def["gearbox"].map(traduction)
autos_def["unrepaired_damage"]=autos_def["unrepaired_damage"].map(traduction)

Below are listed the most common model for the most popular brands.

In [30]:
common_model_for_brands={}
for b in top_brands.index:
    common_model = autos_def[autos_def["brand"] == b].sort_values("model",ascending=False).iloc[0].loc["model"]
    common_model_for_brands[b]=common_model
print(common_model_for_brands)

{'audi': 'tt', 'bmw': 'z_reihe', 'mercedes_benz': 'vito', 'ford': 'transit', 'volkswagen': 'up', 'opel': 'zafira'}


In [31]:
print(autos_def["odometer_km"].unique())

[150000.  70000.  50000.  80000.  10000.  30000. 125000.  90000.  20000.
  60000.   5000.  40000. 100000.]


I'll now analize the price in function of the mileage.

In [32]:
mean_price_for_milage={}
group_milage=autos_def["odometer_km"].unique()
for g in group_milage:
    data_milage = autos_def[autos_def["odometer_km"] == g]
    mean_price=data_milage["price"].mean(skipna=True)
    mean_price_for_milage[g]=mean_price
milage_price_series=pd.Series(mean_price_for_milage)
print(milage_price_series)

5000.0       7101.380403
10000.0     19569.886463
20000.0     16704.047880
30000.0     15734.814913
40000.0     15331.973552
50000.0     13556.101317
60000.0     12113.074866
70000.0     10954.826014
80000.0      9673.630197
90000.0      8473.360862
100000.0     7970.518863
125000.0     6189.368280
150000.0     3778.533110
dtype: float64


Excluding the very first values of milage, which include damaged cars, the price predictibly decreases when the milage increases.

In [33]:
mean_price_for_damage={}
group_damage=autos_def["unrepaired_damage"].unique()
for g in group_damage:
    data_damage = autos_def[autos_def["unrepaired_damage"] == g]
    mean_price=data_damage["price"].mean(skipna=True)
    mean_price_for_damage[g]=mean_price
damage_price_series=pd.Series(mean_price_for_damage)
print(damage_price_series)

NaN            NaN
yes    2250.537757
no     6995.619141
dtype: float64


In average the damaged cars costs less than one third than the non-damaged.
This result, though, is biased by the fact that a luxury car is more probably repaired before the sale.