# Exploring Ebay Car Sales Data #

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle - We sampled 50,000 data points from the full dataset

*The aim of this project is to clean the data and analyze the included used car listings.*


In [1]:
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv",encoding="Latin-1")

In [2]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We can make the following observations:

* The dataset contains 20 columns, most of which are strings.
* Some columns have null values, but none have more than ~20% null values.
* The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

## 1. Cleaning Data

In [3]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
You will see new formatted table below and it should be much easier to read!

In [4]:
autos.rename({"yearOfRegistration":"registration_year"}, axis=1, inplace=True)
autos.rename({"monthOfRegistration":"registration_month"}, axis=1, inplace=True)
autos.rename({"notRepairedDamage":"unrepaired_damage"}, axis=1, inplace=True)
autos.rename({"dateCreated":"ad_created"}, axis=1, inplace=True)
autos.rename({"dateCrawled":"date_crawled"}, axis=1, inplace=True)
autos.rename({"offerType":"offer_type"}, axis=1, inplace=True)
autos.rename({"vehicleType":"vehicle_type"}, axis=1, inplace=True)
autos.rename({"fuelType":"fuel_type"}, axis=1, inplace=True)
autos.rename({"nrOfPictures":"nr_of_pictures"}, axis=1, inplace=True)
autos.rename({"postalCode":"postal_code"}, axis=1, inplace=True)
autos.rename({"lastSeen":"last_seen"}, axis=1, inplace=True)
autos.rename({"powerPS":"power_ps"}, axis=1, inplace=True)

autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [5]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: 
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. 
- Examples of numeric data stored as text which can be cleaned and converted.

In [6]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-29 23:42:13,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Lets look at the table and see if we can notice some things: 
We can see straightaway that a lot of the values are "NaN" which makes sense as we would expect not to see any numerical values for mean/std/percentiles. 
You can also notice that date is stored as text which is not easy to read, similarly price and odometer also need some formatting. 
Seller and Offer_type have exactly the same information except for the "top" row


Lets drop the column "offer_type" since we have the same information in "seller" column:

In [7]:
autos = autos.drop('offer_type',axis=1)

Lets format "price" and "odometer" columns as these are currently expected to be numeric values but they are stored as text. 
For each column:
* We will remove any non-numeric characters
* Convert the column to a numeric dtype


In [8]:
autos["price"] = autos["price"].str.replace("$","")
autos["price"] = autos["price"].str.replace(",","")
autos["price"] = autos["price"].astype(int)

autos["odometer"] = autos["odometer"].str.replace("km","")
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"] = autos["odometer"].astype(int)


In [9]:
autos["price"].unique()

array([ 5000,  8500,  8990, ...,   385, 22200, 16995])

In [10]:
autos["odometer"].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

"Odometer" column being in km is an important thing to remember so lets rename that column to something more memorable

In [11]:
autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)

autos.head()

Unnamed: 0,date_crawled,name,seller,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns. Here's the steps we'll take:

Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.
We'll use:

* Series.unique().shape* to see how many unique values
* Series.describe()* to view min/max/median/mean etc
* Series.value_counts()*, with some variations:
       * chained to .head() if there are lots of values.
       * Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).
* When removing outliers, we can do *df[(df["col"] > x ) & (df["col"] < y )]*, but it's more readable to use *df[df["col"].between(x,y)]*

In [12]:
autos["odometer_km"].shape

(50000,)

In [13]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [14]:
autos["odometer_km"].value_counts().sort_index(ascending=True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

From above we can see that majority of the cars in our dataset are high mileage.

Lets review price column

In [15]:
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [16]:
autos["price"].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

We can see that 1421 cars in our dataset are set at 0 price!

In [17]:
autos["price"].value_counts().sort_index(ascending=True).tail(20)

197000      1
198000      1
220000      1
250000      1
259000      1
265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

There is a big jump in prices from $350,000$ so let reduce our set to include only prices between $1$ and $350,000$

In [18]:
autos = autos[autos["price"].between(1,350000)]

In [19]:
autos["price"].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Let's now move on to the date columns and understand the date range the data covers.

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website

Right now, the `date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:



In [20]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


You'll notice that the first 10 characters represent the day (e.g. `2016-03-12`)

In [21]:
print(autos['date_crawled'].str[:10])

0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
            ...    
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 48565, dtype: object


In [22]:
autos["date_crawled"].value_counts(normalize=True, dropna=False).sort_index(ascending=False)

2016-04-07 14:36:56    0.000021
2016-04-07 14:36:55    0.000021
2016-04-07 14:36:44    0.000021
2016-04-07 14:30:26    0.000021
2016-04-07 14:30:09    0.000021
                         ...   
2016-03-05 14:07:21    0.000021
2016-03-05 14:07:08    0.000021
2016-03-05 14:07:04    0.000021
2016-03-05 14:06:40    0.000021
2016-03-05 14:06:30    0.000021
Name: date_crawled, Length: 46882, dtype: float64

We can see that data was gathered during March and April 2016

In [23]:
autos["ad_created"].value_counts(normalize=True, dropna=False).sort_index(ascending=False)

2016-04-07 00:00:00    0.001256
2016-04-06 00:00:00    0.003253
2016-04-05 00:00:00    0.011819
2016-04-04 00:00:00    0.036858
2016-04-03 00:00:00    0.038855
                         ...   
2015-12-05 00:00:00    0.000021
2015-11-10 00:00:00    0.000021
2015-09-09 00:00:00    0.000021
2015-08-10 00:00:00    0.000021
2015-06-11 00:00:00    0.000021
Name: ad_created, Length: 76, dtype: float64

In [24]:
autos["last_seen"].value_counts(normalize=True, dropna=False).sort_index(ascending=False)

2016-04-07 14:58:50    0.000062
2016-04-07 14:58:48    0.000062
2016-04-07 14:58:46    0.000021
2016-04-07 14:58:45    0.000021
2016-04-07 14:58:44    0.000062
                         ...   
2016-03-05 15:16:47    0.000021
2016-03-05 15:16:11    0.000021
2016-03-05 14:49:34    0.000021
2016-03-05 14:46:02    0.000021
2016-03-05 14:45:46    0.000021
Name: last_seen, Length: 38474, dtype: float64

Now lets move on to the column `registraion_year`:

In [25]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

There are clearly outliers in this data. Min is 1000 = way before cars have been invented and Max = 9999, thats future. Lets have a more detailed look at how many outliers we got:

In [26]:
autos["registration_year"].value_counts().sort_index(ascending=True).head(20)

1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    3
1951    2
1952    1
1953    1
1954    2
Name: registration_year, dtype: int64

In [27]:
autos["registration_year"].value_counts().sort_index(ascending=False).head(40)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
2006    2670
2005    2936
2004    2703
2003    2699
2002    2486
2001    2636
2000    3156
1999    2897
1998    2363
1997    1951
1996    1373
1995    1227
1994     629
1993     425
1992     370
1991     339
1990     347
Name: registration_year, dtype: int64

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the **1900 - 2016** interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

Lets choose the period between 1910 and 2016 because those seem most plausible to me

In [28]:
autos = autos[autos["registration_year"].between(1910,2016)]

autos["registration_year"].value_counts(normalize=True).head(50).sort_index(ascending=False)

2016    0.026135
2015    0.008397
2014    0.014203
2013    0.017202
2012    0.028063
2011    0.034768
2010    0.034040
2009    0.044665
2008    0.047450
2007    0.048778
2006    0.057197
2005    0.062895
2004    0.057904
2003    0.057818
2002    0.053255
2001    0.056468
2000    0.067608
1999    0.062060
1998    0.050620
1997    0.041794
1996    0.029412
1995    0.026285
1994    0.013474
1993    0.009104
1992    0.007926
1991    0.007262
1990    0.007433
1989    0.003727
1988    0.002892
1987    0.001542
1986    0.001542
1985    0.002035
1984    0.001093
1983    0.001093
1982    0.000878
1981    0.000600
1980    0.001821
1979    0.000728
1978    0.000943
1977    0.000471
1976    0.000450
1974    0.000514
1973    0.000493
1972    0.000707
1971    0.000557
1970    0.000814
1968    0.000557
1967    0.000557
1966    0.000471
1960    0.000493
Name: registration_year, dtype: float64

Looks much better and highlights the distribution of the age of the cars towards early 2000s

Now lets focus on `brand` column to evaluate what type of cars we got

In [29]:
autos["brand"].unique()

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [30]:
autos["brand"].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

Clearly, Volkswagen is the winner and actually majority of the car doesnt even take 1% so lets reduce our dataset to only those cars that get 1% of sales - so we will stop after Hyundai

In [31]:
brands = autos["brand"].value_counts(normalize = True)
important_brands = brands[brands > 0.010].index
brand_mean_prices = {}

for row in important_brands:
    brand = autos[autos["brand"]==row]
    mean_price = brand["price"].mean()
    brand_mean_prices[row] = int(mean_price)
    
print(brand_mean_prices)    

{'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749, 'renault': 2474, 'peugeot': 3094, 'fiat': 2813, 'seat': 4397, 'skoda': 6368, 'nissan': 4743, 'mazda': 4112, 'smart': 3580, 'citroen': 3779, 'toyota': 5167, 'hyundai': 5365}


In the last screen, we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

Audi, BMW and Mercedes Benz are more expensive
Ford and Opel are less expensive
Volkswagen is in between

In [32]:
bmp_series = pd.Series(brand_mean_prices)
print(bmp_series)

volkswagen       5402
bmw              8332
opel             2975
mercedes_benz    8628
audi             9336
ford             3749
renault          2474
peugeot          3094
fiat             2813
seat             4397
skoda            6368
nissan           4743
mazda            4112
smart            3580
citroen          3779
toyota           5167
hyundai          5365
dtype: int64


In [33]:
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

Unnamed: 0,mean_price
volkswagen,5402
bmw,8332
opel,2975
mercedes_benz,8628
audi,9336
ford,3749
renault,2474
peugeot,3094
fiat,2813
seat,4397


Now that we have a list of average prices for the top brands, lets calculate mean mileage for each of them

In [34]:
brand_mean_mileage = {}

for row in important_brands:
    brand = autos[autos["brand"]==row]
    mean_mileage = brand["odometer_km"].mean()
    brand_mean_mileage[row] = int(mean_mileage)
print(brand_mean_mileage)    


{'volkswagen': 128707, 'bmw': 132572, 'opel': 129310, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266, 'renault': 128071, 'peugeot': 127153, 'fiat': 117121, 'seat': 121131, 'skoda': 110848, 'nissan': 118330, 'mazda': 124464, 'smart': 99326, 'citroen': 119694, 'toyota': 115944, 'hyundai': 106442}


In [35]:
bmm_series = pd.Series(brand_mean_mileage)
d = {'mean_price': bmp_series,'mean_mileage': bmm_series}
df = pd.DataFrame(data=d).sort_index(ascending=True)
df

Unnamed: 0,mean_price,mean_mileage
audi,9336,129157
bmw,8332,132572
citroen,3779,119694
fiat,2813,117121
ford,3749,124266
hyundai,5365,106442
mazda,4112,124464
mercedes_benz,8628,130788
nissan,4743,118330
opel,2975,129310


We can conclude that while audi,bmw and mercedes are by far the most expensive cars, they also record highest mean mileage