# Exploring Ebay Car Sales Data

This notebook will analyze used car sales listings from eBay Kleinanzeigen, a [classifieds](https://www.ebay-kleinanzeigen.de/s-autos/c216) section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). The version of the dataset we are working with is a sample of 50,000 data points that was prepared by [Dataquest](https://www.dataquest.io/) including simulating a less-cleaned version of the data.

The data dictionary provided with data is as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which which year the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which which year the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Our dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.

# Clean Columns

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Changes to make:
- Change the columns from camelcase to snakecase.
- Change a few wordings to more accurately describe the columns.

In [6]:
rename_dict = {'yearOfRegistration': 'registration_year', 
                              'monthOfRegistration': 'registration_month', 
                              'notRepairedDamage': 'unrepaired_damage', 
                              'dateCreated': 'ad_created'}

In [7]:
def convert(name):
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()

In [8]:
for c in autos.columns:
    if c in rename_dict:
        pass
    else:
        rename_dict[c] = convert(c)

In [9]:
rename_dict

{'abtest': 'abtest',
 'brand': 'brand',
 'dateCrawled': 'date_crawled',
 'dateCreated': 'ad_created',
 'fuelType': 'fuel_type',
 'gearbox': 'gearbox',
 'lastSeen': 'last_seen',
 'model': 'model',
 'monthOfRegistration': 'registration_month',
 'name': 'name',
 'notRepairedDamage': 'unrepaired_damage',
 'nrOfPictures': 'nr_of_pictures',
 'odometer': 'odometer',
 'offerType': 'offer_type',
 'postalCode': 'postal_code',
 'powerPS': 'power_ps',
 'price': 'price',
 'seller': 'seller',
 'vehicleType': 'vehicle_type',
 'yearOfRegistration': 'registration_year'}

In [10]:
autos = autos.rename(columns=rename_dict)

In [11]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


# Initial Data Exploration and Cleaning
Explore the data to find obvious areas where we can clean the data.

In [12]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-08 10:40:35,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Initial observations:

Columns where all (or nearly all) of the values are the same can be dropped:  
- seller  
- offer_type  


The nr_of_pictures column looks odd.

Numeric data stored as text:
- price  
- odometer  

Need more investigation, should be integers?  
- postal_code  
- registration_month  
- registration_year  

In [13]:
autos["nr_of_pictures"].value_counts()

0    50000
Name: nr_of_pictures, dtype: int64

It looks like the num_photos column has 0 for every column. We'll drop this column, plus the other two we noted as mostly one value.

In [14]:
autos = autos.drop(["nr_of_pictures", "seller", "offer_type"], axis=1)


There are two columns, price and auto, which are numeric values with extra characters being stored as text. We'll clean and convert these.

In [15]:
autos["price"] = (autos["price"]
                          .str.replace("$","")
                          .str.replace(",","")
                          .astype(int)
                          )
autos["price"].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [16]:
autos["odometer"].unique()

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

In [17]:
autos["odometer"] = (autos["odometer"]
                             .str.replace("km","")
                             .str.replace(",","")
                             .astype(int)
                             )

In [18]:
# DataFrame.rename supports two calling conventions - https://pandas.pydata.org/pandas-docs/version/0.21.1/generated/pandas.DataFrame.rename.html
# autos.rename(mapper={"odometer": "odometer_km"}, index=1, inplace=True) #https://stackoverflow.com/questions/54912626/python-renaming-pandas-dataframe-columns
autos = autos.rename(columns={"odometer": "odometer_km"})

In [19]:
autos["odometer_km"].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

In [20]:
autos["odometer_km"].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int64

# Exploring Odometer and Price

In [21]:
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field. Additionally, there are more high mileage than low mileage vehicles.

In [22]:
autos = autos.rename(columns = {"odometer":"odometer_km"})

In [23]:
autos["odometer_km"].unique().shape

(13,)

In [24]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [25]:
autos["odometer_km"].value_counts().head()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

In [26]:
autos["odometer_km"].value_counts().sort_index(ascending=False).head()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

Again, the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.

There are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we will consider removing these rows. The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further.

In [27]:
autos["odometer_km"].value_counts().sort_index(ascending=False).head()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64

In [28]:
autos["odometer_km"].value_counts().sort_index(ascending=True).head()

5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64

There are a number of listings with prices below \$30, including about 1,500 at \$0. There are also a small number of listings with very high values, including 14 at around or over $1 million.

Given that eBay is an auction site, there could legitimately be items where the opening bid is \$1. We will keep the \$1 items, but remove anything above \$350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [29]:
autos = autos[autos["price"].between(1,30000000)]

In [30]:
autos["price"].describe()

count    4.857800e+04
mean     8.069542e+03
std      1.800356e+05
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      2.732222e+07
Name: price, dtype: float64

# Exploring the date columns

Columns with date information:

- date_crawled  
- registration_month  
- registration_year  
- ad_created  
- last_seen  

These are a combination of dates that were crawled, and dates with meta-information from the crawler. The non-registration dates are stored as strings.

In [31]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [32]:
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025320
2016-03-06    0.014039
2016-03-07    0.036004
2016-03-08    0.033328
2016-03-09    0.033101
2016-03-10    0.032175
2016-03-11    0.032566
2016-03-12    0.036930
2016-03-13    0.015666
2016-03-14    0.036539
2016-03-15    0.034275
2016-03-16    0.029602
2016-03-17    0.031640
2016-03-18    0.012907
2016-03-19    0.034769
2016-03-20    0.037877
2016-03-21    0.037404
2016-03-22    0.032998
2016-03-23    0.032216
2016-03-24    0.029334
2016-03-25    0.031599
2016-03-26    0.032196
2016-03-27    0.031084
2016-03-28    0.034851
2016-03-29    0.034131
2016-03-30    0.033678
2016-03-31    0.031846
2016-04-01    0.033678
2016-04-02    0.035469
2016-04-03    0.038598
2016-04-04    0.036519
2016-04-05    0.013092
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled, dtype: float64

Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

In [33]:
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001070
2016-03-06    0.004323
2016-03-07    0.005393
2016-03-08    0.007431
2016-03-09    0.009613
2016-03-10    0.010663
2016-03-11    0.012372
2016-03-12    0.023797
2016-03-13    0.008893
2016-03-14    0.012598
2016-03-15    0.015871
2016-03-16    0.016448
2016-03-17    0.028079
2016-03-18    0.007349
2016-03-19    0.015830
2016-03-20    0.020647
2016-03-21    0.020647
2016-03-22    0.021368
2016-03-23    0.018527
2016-03-24    0.019762
2016-03-25    0.019206
2016-03-26    0.016798
2016-03-27    0.015645
2016-03-28    0.020874
2016-03-29    0.022356
2016-03-30    0.024764
2016-03-31    0.023797
2016-04-01    0.022788
2016-04-02    0.024929
2016-04-03    0.025197
2016-04-04    0.024476
2016-04-05    0.124768
2016-04-06    0.221808
2016-04-07    0.131912
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [34]:
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033163
2016-03-10    0.031887
2016-03-11    0.032896
2016-03-12    0.036766
2016-03-13    0.017004
2016-03-14    0.035181
2016-03-15    0.034007
2016-03-16    0.030117
2016-03-17    0.031290
2016-03-18    0.013586
2016-03-19    0.033678
2016-03-20    0.037939
2016-03-21 

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

In [35]:
autos["registration_year"].describe()

count    48578.000000
mean      2004.753119
std         88.632571
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

# Dealing with Incorrect Registration Year Data
Because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

One option is to remove the listings with these values. Let's determine what percentage of our data has invalid values in this column:

In [36]:
autos[autos["registration_year"] < 1900]

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
10556,2016-04-01 06:02:10,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-01 00:00:00,63322,2016-04-01 09:42:30
22316,2016-03-29 16:56:41,VW_Kaefer.__Zwei_zum_Preis_von_einem.,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29 00:00:00,48324,2016-03-31 10:15:28
24511,2016-03-17 19:45:11,Trabant__wartburg__Ostalgie,490,control,,1111,,0,,5000,0,,trabant,,2016-03-17 00:00:00,16818,2016-04-07 07:17:29
32585,2016-04-02 16:56:39,UNFAL_Auto,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-02 00:00:00,63322,2016-04-04 14:46:21
49283,2016-03-15 18:38:53,Citroen_HY,7750,control,,1001,,0,andere,5000,0,,citroen,,2016-03-15 00:00:00,66706,2016-04-06 18:47:20


In [37]:
autos[autos["registration_year"] > 2016]

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
10,2016-03-15 01:41:36,VW_Golf_Tuning_in_siber/grau,999,test,,2017,manuell,90,,150000,4,benzin,volkswagen,nein,2016-03-14 00:00:00,86157,2016-04-07 03:16:21
55,2016-03-07 02:47:54,Mercedes_E320_AMG_zu_Tauschen!,1,test,,2017,automatik,224,e_klasse,125000,7,benzin,mercedes_benz,nein,2016-03-06 00:00:00,22111,2016-03-08 05:45:44
65,2016-04-04 19:30:39,Ford_Fiesta_zum_ausschlachten,250,control,,2017,manuell,65,fiesta,125000,9,benzin,ford,,2016-04-04 00:00:00,65606,2016-04-05 12:22:12
68,2016-04-03 17:36:59,Mini_cooper_s_clubman_/vollausstattung_/_Navi/...,10990,test,,2017,manuell,174,clubman,100000,0,,mini,nein,2016-04-03 00:00:00,83135,2016-04-05 17:26:26
84,2016-03-27 19:52:54,Renault_twingo,900,control,,2018,,60,twingo,150000,0,,renault,,2016-03-27 00:00:00,40589,2016-04-05 18:46:49
113,2016-04-03 14:58:29,Golf_4_Anfaenger_auto,1200,test,,2017,manuell,75,golf,150000,7,,volkswagen,,2016-04-03 00:00:00,97656,2016-04-05 14:15:48
164,2016-03-13 20:39:16,Opel_Meriva__nur_76000_Km__unfallfrei__scheckh...,4800,control,,2018,manuell,0,meriva,80000,4,benzin,opel,nein,2016-03-13 00:00:00,37627,2016-04-04 16:48:02
197,2016-04-05 10:36:24,VW_Polo_9N_an_Bastler,888,control,,2017,manuell,64,polo,20000,7,,volkswagen,ja,2016-04-05 00:00:00,58566,2016-04-07 13:16:13
253,2016-03-27 13:25:18,Ford_mondeo_Gas_anlage_mit_TÜV_04.2017,2250,test,,2017,manuell,0,mondeo,150000,8,benzin,ford,nein,2016-03-27 00:00:00,56575,2016-04-05 15:18:34
348,2016-03-17 20:58:24,VW_Beetle_1.8Turbo_mit_Vollausstattung_und_seh...,3750,control,,2017,manuell,150,beetle,150000,7,,volkswagen,nein,2016-03-17 00:00:00,45896,2016-03-24 17:17:50


In [38]:
(~autos["registration_year"].between(1900,2016)).sum() / autos.shape[0]

0.038824159084359176

Given that this is less than 4% of our data, we will remove these rows.

In [39]:
autos = autos[autos["registration_year"].between(1900,2016)]

In [40]:
autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.067592
2005    0.062880
1999    0.062066
2004    0.057890
2003    0.057804
2006    0.057205
2001    0.056476
2002    0.053243
1998    0.050608
2007    0.048766
Name: registration_year, dtype: float64

It appears that most of the vehicles were first registered in the past 20 years.

# Exploring the Price of SUVs

First we filter all of the autos by SUVs so we can get the brand names of SUVs. Then we can calculate the average price for each band of SUV.

In [41]:
suvs = autos[autos["vehicle_type"] == 'suv']

In [42]:
brands = suvs["brand"].unique()

In [43]:
brand_data = {}

In [44]:
for b in brands:
    print(b)
    selected_rows = autos[autos["brand"]==b]
    avg_price = selected_rows["price"].mean()
    brand_data[b] = avg_price

mazda
jeep
mercedes_benz
sonstige_autos
volkswagen
toyota
chevrolet
bmw
kia
suzuki
hyundai
nissan
ford
dacia
land_rover
lada
skoda
porsche
renault
audi
honda
citroen
daihatsu
mitsubishi
opel
mini
volvo
peugeot
fiat
subaru
rover
saab
chrysler


In [45]:
brand_data

{'audi': 9336.687453600594,
 'bmw': 8571.480147917478,
 'chevrolet': 6684.139097744361,
 'chrysler': 3465.743902439024,
 'citroen': 45486.68549618321,
 'dacia': 5915.528455284553,
 'daihatsu': 1636.1965811965813,
 'fiat': 2813.748538011696,
 'ford': 7456.547932618683,
 'honda': 4107.857923497268,
 'hyundai': 5365.254273504273,
 'jeep': 11650.5,
 'kia': 5982.330303030303,
 'lada': 2688.296296296296,
 'land_rover': 19108.091836734693,
 'mazda': 4112.596614950635,
 'mercedes_benz': 8628.450366422385,
 'mini': 10613.459657701711,
 'mitsubishi': 3394.5729166666665,
 'nissan': 4743.40252454418,
 'opel': 5432.479195699781,
 'peugeot': 3094.0172290021537,
 'porsche': 45643.93706293706,
 'renault': 2474.8646069968195,
 'rover': 1602.2903225806451,
 'saab': 3211.6493506493507,
 'skoda': 6368.0,
 'sonstige_autos': 45208.36442516269,
 'subaru': 3953.29,
 'suzuki': 4096.732851985559,
 'toyota': 5167.091062394604,
 'volkswagen': 6729.81956411556,
 'volvo': 4946.501170960188}