#### eBay Kleinanzeigen

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle.


---
The data dictionary provided with data is as follows:

* `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
* `name` - Name of the car.
* `seller` - Whether the seller is private or a dealer.
* `offerType` - The type of listing
* `price` - The price on the ad to sell the car.
* `abtest` - Whether the listing is included in an A/B test.
* `vehicleType` - The vehicle Type.
* `yearOfRegistration` - The year in which the car was first registered.
* `gearbox` - The transmission type.
* `powerPS` - The power of the car in PS.
* `model` - The car model name.
* `kilometer` - How many kilometers the car has driven.
* `monthOfRegistration` - The month in which the car was first registered.
* `fuelType` - What type of fuel the car uses.
* `brand` - The brand of the car.
* `notRepairedDamage` - If the car has a damage which is not yet repaired.
* `dateCreated` - The date on which the eBay listing was created.
* `nrOfPictures` - The number of pictures in the ad.
* `postalCode` - The postal code for the location of the vehicle.
* `lastSeenOnline` - When the crawler saw this ad last online.

---
** The aim of this project is to clean the data and analyze the included used car listings.

We start by importing the Pnadas and Numpy libraries and reading the file into the program.

In [1]:
#Slide 1

import numpy as np
import pandas as pd

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [2]:
print(autos.info())
print('/n')
print(autos.head())
print('/n')
print(autos.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Our dataframe has 50,000 rows and 20 coloumns, its a mixture of strings and numerals, the name coloumn is mixed and most of the lables are in German, nevertheless they are easy to understand. 

Some columns have null values, but none have more than ~20% null values.
The column names use CamelCase instead of Python's preferred snake_case, which means we can't just replace spaces with underscores.

In [4]:
#Slide 2 

print(autos.columns)
print('/n')
import re

col = autos.columns
new_col = []

for c in col: #quick lookup table
    if c == 'yearOfRegistration':
        c = 'registration_year'
        
    if c == 'monthOfRegistration':
        c = 'registration_month'
        
    if c == 'notRepairedDamage':
        c = 'unrepaired_damage'
        
    if c == 'dateCreated':
        c = 'ad_created'
        
    else:
        c = re.sub(r'(?<!^)(?=[A-Z])', '_', c).lower()
        
        
    new_col.append(c)
    
autos.columns = new_col

autos.head()
##print(new_col)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
/n


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We made some changes to the column labels to make the text  "snake_case" to align with the preferred python method. 

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
- Examples of numeric data stored as text which can be cleaned and converted

In [5]:
#Slide 3

#if you don't use "include all" it will look at numbers only 
autos.describe(include = 'all') 


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-21 20:37:19,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [6]:
# Slide 4:
# Data explorer function to view and find the value counts corresponding to each column.
def data_explorer(col_name):
    print ('Looking at:',col_name)
    print(autos[col_name].head())
    print('-------------------')
    print(autos[col_name].value_counts())
    print('\n')
    
data_explorer("date_crawled")
data_explorer("name")
data_explorer("seller")
data_explorer("offer_type")
data_explorer("price")
data_explorer("abtest")
data_explorer("vehicle_type")
data_explorer("registration_year")
data_explorer("gearbox")
data_explorer("power_p_s")
data_explorer("model")
data_explorer("odometer")
data_explorer("registration_month")
data_explorer("fuel_type")
data_explorer("unrepaired_damage")
data_explorer("ad_created")
data_explorer("nr_of_pictures")
data_explorer("postal_code")
data_explorer("last_seen")

Looking at: date_crawled
0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object
-------------------
2016-03-21 20:37:19    3
2016-03-25 19:57:10    3
2016-03-08 10:40:35    3
2016-03-16 21:50:53    3
2016-03-21 16:37:21    3
                      ..
2016-03-31 22:52:59    1
2016-03-10 23:55:06    1
2016-04-01 21:42:12    1
2016-04-04 03:03:23    1
2016-03-08 14:47:05    1
Name: date_crawled, Length: 48213, dtype: int64


Looking at: name
0                     Peugeot_807_160_NAVTECH_ON_BOARD
1           BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                           Volkswagen_Golf_1.6_United
3    Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4    Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
Name: name, dtype: object
-------------------
Ford_Fiesta                                                  78
Volkswagen_Golf_1.4                                          75
BMW_3



* date_crawled: many unique values, string
* name: many unique values string
* seller: only one value - can be discarded
* offer type:  only one value - can be discarded
* price: many unique values, numbers stored as string
* abtest: only two unique values, don't know if is needed or not split 50-50
* vehicle type: string, required
* registration year: many unique types, int64, some odd vlaues that ned to be cleaned out. 
* gearbox: only two unique values, don't know if is needed or not split 80-20 to manual
* PowerPS: many unique values, int 64
* Model: String, many unique values
* odometer: number saves as string, has km in it 
* registration_month: out of 12, and its numeral int64
* fuel type: 7 unique trype, string
* unrepaired_damage: only 2 values, dont knwo if I should keep
* ad_created: dates, saved as string, many unique values

For `price` and `odometer`, will:
* Remove any non-numeric characters.
* Convert the column to a numeric dtype.
* Use `DataFrame.rename()` to rename the column to `odometer_km`.

In [74]:

autos["price"] = autos["price"].str.replace('$','').str.replace(',','').astype(float)

autos['odometer'] = autos["odometer"].str.replace('km','').str.replace(',','').astype(float)

autos.rename(columns = {'odometer':'odometer_km'}, inplace = True)

We use data exploration for odometer_km and price to find unique and max and min values

In [75]:
## for odometer_km

print("unique values?")
print(autos["odometer_km"].unique().shape) #to see how many unique values
print('\n')
print("min/max/median/mean etc")    
print(autos["odometer_km"].describe()) #to view max and min and stats
print('\n')

print ('Looking at the first five of the value counts')
print(autos["odometer_km"].value_counts().head())
print('\n')

print ('Looking at the odometers highest and lowest value with their counts')
print (autos["odometer_km"].value_counts().sort_index(ascending = 'Ture'))

unique values?
(13,)


min/max/median/mean etc
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


Looking at the first five of the value counts
150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
Name: odometer_km, dtype: int64


Looking at the odometers highest and lowest value with their counts
5000.0        967
10000.0       264
20000.0       784
30000.0       789
40000.0       819
50000.0      1027
60000.0      1164
70000.0      1230
80000.0      1436
90000.0      1757
100000.0     2169
125000.0     5170
150000.0    32424
Name: odometer_km, dtype: int64


There are 13 unique ododmeter values, but the majority f samples are concentrated around the 125000 to 150000 km mark, so we were going to use those only, but realised that not feasable as we lose a lot of potentially useful data, so we go above 10000 km

In [76]:
#For Price

print("unique values?")
print(autos["price"].unique().shape) #to see how many unique values
print('\n')
print("min/max/median/mean etc")    
print(autos["price"].describe()) #to view max and min and stats
print('\n')

print ('Looking at the first five of the value counts')
print(autos["price"].value_counts().head(100))
print('\n')

print ('Looking at the price highest and lowest value with their counts')
print (autos["price"].value_counts().sort_index(ascending = 'Ture'))

unique values?
(2357,)


min/max/median/mean etc
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


Looking at the first five of the value counts
0.0        1421
500.0       781
1500.0      734
2500.0      643
1200.0      639
1000.0      639
600.0       531
800.0       498
3500.0      498
2000.0      460
999.0       434
750.0       433
900.0       420
650.0       419
850.0       410
700.0       395
4500.0      394
300.0       384
2200.0      382
950.0       379
1100.0      376
1300.0      371
3000.0      365
550.0       356
1800.0      355
5500.0      340
1250.0      335
350.0       335
1600.0      327
1999.0      322
           ... 
4999.0      174
2950.0      173
2750.0      169
6900.0      165
7000.0      165
8000.0      161
1499.0      157
1.0         156
7900.0      156
5200.0      156
3700.0      155
1550.0      150
990.0      

 There are 2357 unique price values in this column. By far the majority is for price of 0, and then there is a slow decreses for the pricing. TO remove the ourliers we remove everything above 30000 dollars in price. 

In [77]:
##removing outliers for odometer and price 

autos = autos[autos["odometer_km"].between(999,150001)]

autos = autos[autos["price"].between(0,110000)]

autos.describe(include = 'all')


Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_p_s,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,49951,49951,49951,49951,49951.0,49951,44860,49951.0,47278,49951.0,47201,49951.0,49951.0,45477,49951,40130,49951,49951.0,49951.0,49951
unique,48166,38715,2,2,,2,8,,2,,245,,,7,40,2,76,,,39450
top,2016-03-05 16:57:05,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49950,49950,,25730,12854,,36969,,4024,,,30070,10684,35192,1944,,,8
mean,,,,,5599.885688,,,2005.075434,,116.166864,,125797.281336,5.72399,,,,,0.0,50807.675102,
std,,,,,7522.211811,,,105.763499,,209.143693,,39970.189911,3.712435,,,,,0.0,25777.770796,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1100.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30450.0,
50%,,,,,2900.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49565.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71522.0,


We now have 34307 values, we can confirm that the outlier values for price and odometer have been removed. It seems that there were around 16000 outliers.

In [9]:
### Slide 5: Exploring the date columns

#for date_crawled
date_crawled_stats = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False)*100
print ('date_crawled_stats')
print(date_crawled_stats.sort_index())
print('\n')

#for ad_created
ad_created_stats = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)*100
print ('ad_created_stats')
print(ad_created_stats.sort_index())
print('\n')

#for last_seen
last_seen_stats = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)*100
print ('last_seen_stats')
print(last_seen_stats.sort_index())


date_crawled_stats
2016-03-05    2.538
2016-03-06    1.394
2016-03-07    3.596
2016-03-08    3.330
2016-03-09    3.322
2016-03-10    3.212
2016-03-11    3.248
2016-03-12    3.678
2016-03-13    1.556
2016-03-14    3.662
2016-03-15    3.398
2016-03-16    2.950
2016-03-17    3.152
2016-03-18    1.306
2016-03-19    3.490
2016-03-20    3.782
2016-03-21    3.752
2016-03-22    3.294
2016-03-23    3.238
2016-03-24    2.910
2016-03-25    3.174
2016-03-26    3.248
2016-03-27    3.104
2016-03-28    3.484
2016-03-29    3.418
2016-03-30    3.362
2016-03-31    3.192
2016-04-01    3.380
2016-04-02    3.540
2016-04-03    3.868
2016-04-04    3.652
2016-04-05    1.310
2016-04-06    0.318
2016-04-07    0.142
Name: date_crawled, dtype: float64


ad_created_stats
2015-06-11    0.002
2015-08-10    0.002
2015-09-09    0.002
2015-11-10    0.002
2015-12-05    0.002
              ...  
2016-04-03    3.892
2016-04-04    3.688
2016-04-05    1.184
2016-04-06    0.326
2016-04-07    0.128
Name: ad_created, Length: 7

* date crawled: Just over a month with pretty uniform distribution 
* ad_created: Just over 3 months, with very low percentage over december and january and being uniform around march.
* last seen: Just over a month (same as the date crawled in terms of range, but this one is manily concentrated on 3 dats of 05, 06, 07 of April 2016.

In [8]:
### Slide 6: Dealing with Incorrect Registration Year Data
autos['registration_year'].describe()

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [80]:
autos['registration_year'].value_counts().sort_index()

1000       1
1001       1
1111       1
1500       1
1800       2
1910       9
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       1
1952       1
1953       1
1954       2
1955       2
1956       5
1957       2
1958       4
1959       7
1960      33
1961       6
1962       4
1963       9
        ... 
2001    2701
2002    2533
2003    2727
2004    2737
2005    3015
2006    2707
2007    2303
2008    2230
2009    2097
2010    1594
2011    1634
2012    1320
2013     802
2014     663
2015     395
2016    1310
2017    1452
2018     491
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, Length: 97, dtype: int64

I can see that the first value of 40 occures at 1980. So I decided that this is the cut off date, there are earlier cars but too much of an outlier. 

an interesting thing is that 2017 has over 1000 cars listed. This is obviously wrong. So I will leave the 2017 ones. Hence the expression below:

In [81]:
autos = autos[autos["registration_year"].between(1979,2017)]
autos["registration_year"].value_counts(normalize = True)*100

2000    6.851666
2005    6.160982
1999    6.126244
2004    5.592905
2003    5.572471
2006    5.531602
2001    5.519341
2002    5.176043
1998    5.006437
2007    4.706051
2008    4.556879
2009    4.285101
1997    4.142060
2011    3.338987
2010    3.257249
2017    2.967080
1996    2.950733
2012    2.697346
1995    2.678955
2016    2.676911
2013    1.638842
2014    1.354803
1994    1.348673
1993    0.909332
1990    0.807160
2015    0.807160
1992    0.794900
1991    0.727466
1989    0.369863
1988    0.290169
1985    0.212518
1980    0.198214
1986    0.155302
1987    0.153258
1983    0.108303
1984    0.108303
1982    0.087868
1979    0.071521
1981    0.061303
Name: registration_year, dtype: float64

We can see that the highest number of cars belongs to 2000, 1999, 2001, and 2003, and then it tapers of after that.

### Slide 7. Exploring Price by Brand

In [82]:
print("The unique brand names are:")
autos["brand"].unique()

The unique brand names are:


array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'land_rover', 'alfa_romeo', 'lada', 'rover',
       'trabant', 'daihatsu', 'lancia'], dtype=object)

In [83]:
print('brand names sorted based on percentage of value counts')

print(autos["brand"].value_counts(normalize = True)*100) #%percentages
print('\n')
print('brand names sorted based on raw value counts')
autos["brand"].value_counts() #normal count


brand names sorted based on percentage of value counts
volkswagen        21.452071
bmw               10.954901
opel              10.930380
mercedes_benz      9.420275
audi               8.686679
ford               6.929317
renault            4.828657
peugeot            2.958906
fiat               2.580869
seat               1.867707
skoda              1.591843
mazda              1.534626
nissan             1.530539
smart              1.420193
citroen            1.395672
toyota             1.252631
hyundai            0.991070
volvo              0.907289
sonstige_autos     0.882768
mini               0.860290
mitsubishi         0.815334
honda              0.796943
kia                0.721336
alfa_romeo         0.651859
suzuki             0.590555
chevrolet          0.531295
porsche            0.514948
chrysler           0.363733
dacia              0.261561
daihatsu           0.255430
jeep               0.218648
subaru             0.218648
land_rover         0.196171
daewoo             0.

volkswagen        10498
bmw                5361
opel               5349
mercedes_benz      4610
audi               4251
ford               3391
renault            2363
peugeot            1448
fiat               1263
seat                914
skoda               779
mazda               751
nissan              749
smart               695
citroen             683
toyota              613
hyundai             485
volvo               444
sonstige_autos      432
mini                421
mitsubishi          399
honda               390
kia                 353
alfa_romeo          319
suzuki              289
chevrolet           260
porsche             252
chrysler            178
dacia               128
daihatsu            125
jeep                107
subaru              107
land_rover           96
daewoo               78
saab                 78
jaguar               73
rover                69
trabant              56
lancia               55
lada                 25
Name: brand, dtype: int64

In [84]:
autos["brand"].value_counts().index #brand names in the order of descent

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'hyundai', 'volvo', 'sonstige_autos', 'mini',
       'mitsubishi', 'honda', 'kia', 'alfa_romeo', 'suzuki', 'chevrolet',
       'porsche', 'chrysler', 'dacia', 'daihatsu', 'jeep', 'subaru',
       'land_rover', 'daewoo', 'saab', 'jaguar', 'rover', 'trabant', 'lancia',
       'lada'],
      dtype='object')

I would like to use the first 10 cars.

In [85]:
mean_brand_price = {}

loop_array = autos["brand"].value_counts().index[0:10]

for b in loop_array:
    subset = autos[autos["brand"] == b]
    price_mean = subset["price"].mean()
    mean_brand_price [b]= price_mean

print("Mean prices for the 10 most popular brands are")
mean_brand_price

Mean prices for the 10 most popular brands are


{'audi': 8947.390731592566,
 'bmw': 7925.54933781011,
 'fiat': 2634.270783847981,
 'ford': 3343.857564140372,
 'mercedes_benz': 8169.280260303688,
 'opel': 2835.0994578425875,
 'peugeot': 3008.8232044198894,
 'renault': 2319.7731696995343,
 'seat': 4264.2045951859955,
 'volkswagen': 5145.990950657268}

We aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

- Audi, BMW and Mercedes Benz are more expensive
- Ford and Opel are less expensive
- Volkswagen is in between

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:

- it's difficult to compare more than two aggregate series objects if we want to extend to more columns
- we can't compare more than a few rows from each series object
- we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons
Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly.

The top 3 expensive cars are BMW, Mercedes and Audi, the next tier are VW, seat, and ford, and peuget, and the cheapest are: opel, renault, and fiat.

For the top 6 brands, let's use aggregation to understand the *average mileage for those cars and if there's any visible link with mean price*. 

** ATTENTION**: You first construct a series out of the dictiory and then data frame out of the series. You can't skip the series step

In [86]:
bmp_series = pd.Series(mean_brand_price)
print(bmp_series)

df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

audi             8947.390732
bmw              7925.549338
fiat             2634.270784
ford             3343.857564
mercedes_benz    8169.280260
opel             2835.099458
peugeot          3008.823204
renault          2319.773170
seat             4264.204595
volkswagen       5145.990951
dtype: float64


Unnamed: 0,mean_price
audi,8947.390732
bmw,7925.549338
fiat,2634.270784
ford,3343.857564
mercedes_benz,8169.28026
opel,2835.099458
peugeot,3008.823204
renault,2319.77317
seat,4264.204595
volkswagen,5145.990951


In [87]:
## Doing the agreggation loop and everything for the mean milage. Using the same loop variable as above

mean_brand_milage = {}

for b in loop_array:
    subset = autos[autos["brand"] == b]
    milage_mean = subset["odometer_km"].mean()
    mean_brand_milage [b]= milage_mean

print("Mean milage for the 10 most popular brands are")
mean_brand_milage


Mean milage for the 10 most popular brands are


{'audi': 129597.74170783345,
 'bmw': 132739.22775601567,
 'fiat': 117414.88519398258,
 'ford': 124991.153052197,
 'mercedes_benz': 131427.33188720173,
 'opel': 129687.7921106749,
 'peugeot': 127365.33149171271,
 'renault': 128440.54168429962,
 'seat': 122155.36105032823,
 'volkswagen': 129164.60278148219}

In [88]:
## Doing the series out of the dictionary for mean milage

bmm_series = pd.Series(mean_brand_milage, name = "mean_milage")
print(bmm_series)

df = df.join(bmm_series)
df

audi             129597.741708
bmw              132739.227756
fiat             117414.885194
ford             124991.153052
mercedes_benz    131427.331887
opel             129687.792111
peugeot          127365.331492
renault          128440.541684
seat             122155.361050
volkswagen       129164.602781
Name: mean_milage, dtype: float64


Unnamed: 0,mean_price,mean_milage
audi,8947.390732,129597.741708
bmw,7925.549338,132739.227756
fiat,2634.270784,117414.885194
ford,3343.857564,124991.153052
mercedes_benz,8169.28026,131427.331887
opel,2835.099458,129687.792111
peugeot,3008.823204,127365.331492
renault,2319.77317,128440.541684
seat,4264.204595,122155.36105
volkswagen,5145.990951,129164.602781


In [98]:
print(df.sort_values('mean_price', axis = 0, ascending=False))
print('\n')
print(df.sort_values('mean_milage', axis = 0, ascending=False))


                mean_price    mean_milage
audi           8947.390732  129597.741708
mercedes_benz  8169.280260  131427.331887
bmw            7925.549338  132739.227756
volkswagen     5145.990951  129164.602781
seat           4264.204595  122155.361050
ford           3343.857564  124991.153052
peugeot        3008.823204  127365.331492
opel           2835.099458  129687.792111
fiat           2634.270784  117414.885194
renault        2319.773170  128440.541684


                mean_price    mean_milage
bmw            7925.549338  132739.227756
mercedes_benz  8169.280260  131427.331887
opel           2835.099458  129687.792111
audi           8947.390732  129597.741708
volkswagen     5145.990951  129164.602781
renault        2319.773170  128440.541684
peugeot        3008.823204  127365.331492
ford           3343.857564  124991.153052
seat           4264.204595  122155.361050
fiat           2634.270784  117414.885194


### Conclusions

* Sorting by mean price, the top three are Audi, Merc and BMW. And their mean milages are rellay close at around 130,000 km. 

* The second tier are VW, seat and Ford, hovering between mean milages of 129,000 km to 125,000 km

* The third tier cars peugeout, opel and fiat also have a similar range of kilometers to the second car. So we realise that the top three retain best value over high milages.

* Sorting by mean milage, reveals that the highest mean milage is form BMW, then Merc and then opel. Opel has the lowest sales price with the highest milage. 

BMW has the highest sale price 