# Guided Project : Exploring Ebay Car Sales Data

## Introduction

In this project, we will work with a dataset of used cars from _eBay Kleinanzeigen_, a section of the German eBay website.

The aim of the project is to clean the data and analyze the included used car listings.

### Summary

The data cleaning and exploring done in this project is not exhaustive and a lot more can be done on this dataset.

Below is a summary of the cleaning actions took and conlusions made from exploring the dataset.

The cleaning actions consisted in:
- renaming the columns names to make them more readable on usable
- getting rid of the useless descriptors such as the `num_photos`, `seller` and `offer_type` columns
- converting the `price` and `odometer` columns to numeric types
- renaming the `odometer` column to include the unit
- getting rid of the ads featuring an odd price
- getting rid of the ads featuring a registration year before 1900 or after the listing year

The exploration steps allowed us to get the following insights:
- the data were crawled from the website daily over a period of roughly one month
- the distribution of ads crawled over this period is uniform
- the ads were created over a period of roughly a year
- the dataset contains ads of cars from 40 different brands
- some brands are more represented than others
- among the most represented brands, german brands Audi, Mercedes and BMW feature higher ads prices on average than the others (Volkswagen, Opel and ford)
- the average mileage difference between these brands doesn't reflect this price difference
- the price difference may be explained by the reknown quality and longevity of german cars




## The Data

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.

The original dataset isn't available on Kaggle anymore but can be found [here](https://data.world/data-society/used-cars-data).

Here, we will be working with a modified version of the original dataset provided by Dataquest:
- 50,000 data points were sampled from the full dataset, to ensure our code runs quickly in Dataquest's hosted environment
- the dataset was dirtied a bit to more closely resemble what we would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

We start by importing the librairies and the dataset.

In [2]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv',encoding='Windows-1252')

Let's check the imported dataset!

In [3]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [4]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Checking the first few raws of our dataset and calling the `info` method on it allows us to make some key observations :
- The dataset contains 50,000 entries and 20 columns as expected
- Its columns are of type object (strings) or integers
- 5 columns out of 20 feature null values, but none of them have more than 20% null values
- The columns names use camelcase instead of Python's preferred snakecase
- The `price` and `odometer` columns values are a mix of both numeric and string types as they feature the respective units ($ and km)
- The `powerPS` columns contains some 0 values, which doesn't make sense

## Cleaning the dataset

### Cleaning the columns names

We first start by converting the columns names from camelcase to snakecase and rewording some of the columns names based on the data dictionnay to be more descriptive.

In [5]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [6]:
autos.columns = ['date_crawled','name','seller','offer_type','price','ab_test','vehicle_type','registration_year','gearbox','power_ps','model','odometer','registration_month','fuel_type','brand','unrepaired_damage','ad_created','num_photos','postal_code','last_seen']

In [7]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Initial data exploration and cleaning

Now, let's do some basic data exploration to determine what other cleaning tasks need to be done.

Initially, we will look for:
- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
- Examples of numeric data stored as text which can be cleaned and converted.

Let's look at descriptive statistics for all columns :

In [8]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 17:37:35,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Looking at these descriptive statistics, we can make the following observations:

- `seller` and `offer_type` columns only contain two unique values each. For the `seller` column, _privat_ occurs 49999 times out of 50000. It is the same for `offer_type` where _Angebot_ has the same frequency of occurence.
- As mentioned previously, the `price` and `odometer` columns contain numeric values stored as text under the form of a value number (price or distance) followed by its unit ($ and km).
- The `num_photos` column seems to be filled with 0s and needs to be investigated further.

Let's have a look at the `num_photos` column.

In [9]:
autos["num_photos"].value_counts()

0    50000
Name: num_photos, dtype: int64

It looks like the `num_photos` column is only filled with 0s as we expected. We then decide to drop it, together with the `seller` and `offer_type` columns.

In [41]:
autos.drop(["seller","offer_type","num_photos"],axis=1,inplace=True)

Next, we remove the units in the `price` and `odometer` columns and convert them to a numeric dtype.

In [10]:
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)

In [11]:
autos['price'].head()

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [12]:
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)

In [13]:
autos['odometer'].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

The columns are now correctly formatted.

We finally rename the column `odometer` to `odometer_km` for sake of clarity.

In [14]:
autos.rename({'odometer':'odometer_km'},axis=1,inplace=True)

In [15]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen'],
      dtype='object')

### Exploring the price and odometer columns

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the `odometer_km` and `price` columns.

We print the descriptive statistics table again for these two columns only.

In [16]:
autos[['odometer_km','price']].describe()

Unnamed: 0,odometer_km,price
count,50000.0,50000.0
mean,125732.7,9840.044
std,40042.211706,481104.4
min,5000.0,0.0
25%,125000.0,1100.0
50%,150000.0,2950.0
75%,150000.0,7200.0
max,150000.0,100000000.0


The minimum and maximum values of the `odometer_km` column seem realistic.

Concerning the `price` column, the minimum value is 0\$ and the maximum is 100,000,000\$. Both values seem unrealistic and we need to explore this further.

Let's check the number of unique values for each column.

In [17]:
autos['odometer_km'].unique().shape

(13,)

The `odometer_km` column features 13 unique values. This is an unexpectedly low number of values. These mileage values have been rounded.
The seller must have had to choose a pre-set mileage on the ad creation page.

Let's look at the counts for each of these 13 unique values.

In [42]:
autos['odometer_km'].value_counts().sort_index(ascending=False)

150000    30085
125000     4857
100000     2058
90000      1673
80000      1375
70000      1187
60000      1128
50000       993
40000       797
30000       760
20000       742
10000       241
5000        785
Name: odometer_km, dtype: int64

We can see that there is more high mileage vehicles than low mileage ones.
The data seem alright and we decide to leave this column as is.

Next, we look more closely at the `price` column.

In [19]:
print(autos['price'].unique().shape)

(2357,)


The `price` columns features 2357 unique values. Let's have a look at the counts for those values.

In [20]:
autos['price'].value_counts().sort_index(ascending=True).head(20)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

In [21]:
autos['price'].value_counts().sort_index(ascending=False).head(30)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
169000      1
163991      1
163500      1
155000      1
151990      1
Name: price, dtype: int64

There are 1,421 cars with a price of 0\$. This represents only about 3% of the dataset. We may remove those data points.
There are also quite a few of them with prices below the 30\$ mark.
The most expensive car features a price of 99,999,999\$ and several more are above the 500,000\$ price point.

Given that _eBay_ is an auction site, there may well be ads with opening bids starting at 1\$.
Although we decide to remove the ads with a 0\$ starting price, we keep the other low priced ones.
For the higher prices, there seems to be a jump between 350,000\$ and 999,990\$, with a few outliers above that upper range.
We decide to get rid of the ads with prices above (and including) 999,990\$.

In [22]:
autos = autos[autos['price'].between(1,350000)]
autos['price'].describe()

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

### Exploring the date columns

We continue our data cleaning task by looking more in depth at the date columns.

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself.

- `date_crawled`: added by the crawler
- `last_seen`: added by the crawler
- `ad_created`: from the website
- `registration_month`: from the website
- `registration_year`: from the website

Right now, the `date_crawled`, `last_seen` and `ad_created` columns are all identified as string values by Pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like `Series.describe()` to understand the distribution without any extra data processing.

Let's print the first 5 rows of the string formatted columns.

In [44]:
autos[['date_crawled','ad_created','last_seen']].head(5)

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


We want to know the date range of the ads crawling. 
We extract just the date information from the `date_crawled` column values and print the lowest and highest date in this serie.


In [47]:
autos['date_crawled'].str[:10].sort_values(ascending=True).head(1)

11986    2016-03-05
Name: date_crawled, dtype: object

In [25]:
autos['date_crawled'].str[:10].sort_values(ascending=False).head(1)

47885    2016-04-07
Name: date_crawled, dtype: object

The data were crawled from the 5th of March 2016 to the 7th of April of that same year.

Let's look the at how the ads creation date is distributed over that period of time.

In [26]:
autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=False)

2016-04-07    0.001400
2016-04-06    0.003171
2016-04-05    0.013096
2016-04-04    0.036487
2016-04-03    0.038608
2016-04-02    0.035478
2016-04-01    0.033687
2016-03-31    0.031834
2016-03-30    0.033687
2016-03-29    0.034099
2016-03-28    0.034860
2016-03-27    0.031092
2016-03-26    0.032204
2016-03-25    0.031607
2016-03-24    0.029342
2016-03-23    0.032225
2016-03-22    0.032987
2016-03-21    0.037373
2016-03-20    0.037887
2016-03-19    0.034778
2016-03-18    0.012911
2016-03-17    0.031628
2016-03-16    0.029610
2016-03-15    0.034284
2016-03-14    0.036549
2016-03-13    0.015670
2016-03-12    0.036920
2016-03-11    0.032575
2016-03-10    0.032184
2016-03-09    0.033090
2016-03-08    0.033296
2016-03-07    0.036014
2016-03-06    0.014043
2016-03-05    0.025327
Name: date_crawled, dtype: float64

In [27]:
autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).describe()

count    34.000000
mean      0.029412
std       0.009762
min       0.001400
25%       0.029980
50%       0.032781
75%       0.034840
max       0.038608
Name: date_crawled, dtype: float64

It looks like the site was crawled daily over roughly a one month period. Moreover, we can observe that the distribution of the listings crawled on each day is roughly uniform. No particular cleaning action is required for this column.

Next, we look at the `ad_created` column. We want to know the time period the ads were created over.

In [61]:
autos['ad_created'].str[:7].sort_values(ascending=True).head(1)

22781    2015-06
Name: ad_created, dtype: object

In [62]:
autos['ad_created'].str[:7].sort_values(ascending=False).head(1)

19651    2016-04
Name: ad_created, dtype: object

The ads were created over a period of roughly one year, between May 2015 and April 2016. This column doesn't contain any odd date unexpectedly far away in the future or in the past.

We then look at the distribution of ad creation dates.

In [63]:
autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False)

2016-04-03    0.039009
2016-03-20    0.038067
2016-03-21    0.037531
2016-04-04    0.036953
2016-03-12    0.036653
2016-04-02    0.035196
2016-03-07    0.035004
2016-03-14    0.035004
2016-03-28    0.034725
2016-03-15    0.034168
2016-03-29    0.034082
2016-04-01    0.033825
2016-03-19    0.033611
2016-03-30    0.033611
2016-03-08    0.033568
2016-03-09    0.033290
2016-03-11    0.032754
2016-03-22    0.032583
2016-03-26    0.032090
2016-03-23    0.032090
2016-03-10    0.031962
2016-03-31    0.031876
2016-03-25    0.031683
2016-03-17    0.031469
2016-03-27    0.030612
2016-03-16    0.029927
2016-03-24    0.029434
2016-03-05    0.022793
2016-03-13    0.017180
2016-03-06    0.015381
                ...   
2016-02-21    0.000064
2016-02-05    0.000043
2016-02-12    0.000043
2016-02-02    0.000043
2016-01-27    0.000043
2016-02-24    0.000043
2016-02-20    0.000043
2016-02-26    0.000043
2016-02-18    0.000043
2016-01-10    0.000043
2016-02-14    0.000043
2016-01-03    0.000021
2016-01-07 

There is a large variety of ad created dates. The distribution looks fine. No particular cleaning action is required for this column.

To continue, we have a further look at the `last_seen` column.

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

Let's have a look at the distribution of 'last seen' dates.

In [65]:
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=False)

2016-04-07    0.132752
2016-04-06    0.223324
2016-04-05    0.125404
2016-04-04    0.024121
2016-04-03    0.025149
2016-04-02    0.024657
2016-04-01    0.022943
2016-03-31    0.023628
2016-03-30    0.024614
2016-03-29    0.022086
2016-03-28    0.020694
2016-03-27    0.015638
2016-03-26    0.016795
2016-03-25    0.018937
2016-03-24    0.019687
2016-03-23    0.018359
2016-03-22    0.020844
2016-03-21    0.020587
2016-03-20    0.020629
2016-03-19    0.015617
2016-03-18    0.007219
2016-03-17    0.028084
2016-03-16    0.016281
2016-03-15    0.016002
2016-03-14    0.012660
2016-03-13    0.008654
2016-03-12    0.023757
2016-03-11    0.012382
2016-03-10    0.010690
2016-03-09    0.009768
2016-03-08    0.007476
2016-03-07    0.005377
2016-03-06    0.004113
2016-03-05    0.001071
Name: last_seen, dtype: float64

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

Next, we want to investigate the `registration_year` column.

In [31]:
autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

Because a car can't be registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Reallistically, it could be somewhere in the first few decades of 1900s.

Let's check the percentage of cars that feature a registration year between the 1900-2016 range.

In [32]:
round(autos["registration_year"].between(1900,2016).sum()/len(autos),2)

0.96

Only 4% of the ads feature a registration year outside of the 1900-2016 range. We thus decide to get rid of those data points.

In [33]:
autos = autos[autos["registration_year"].between(1900,2016)]

Next, we calculate the distribution of the remaining ads.

In [34]:
autos["registration_year"].value_counts(normalize=True).sort_values(ascending=False).head(10)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
Name: registration_year, dtype: float64

It appears that most of the vehicles were registered in the last 20 years.

### Exploring the brands column

Let's now explore the `brand` column.

In [35]:
autos["brand"].unique()

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [67]:
autos["brand"].unique().shape

(40,)

The dataset contains cars of 40 different brands. 

We calculate the percentages of ads associated with each brand.

In [36]:
autos["brand"].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

All the brands are not represented the same in the dataset. We pursue our analysis with the brands that have the highest percentage of ads over the total dataset (more than 5%).

In [37]:
brands_count = autos["brand"].value_counts(normalize=True)
selected_brands = brands_count[brands_count > 0.05].index
print(selected_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')


That leaves us with 6 brands.

We calculate the mean price of ads for each of those brands.

In [38]:
brands_mean_prices = {}

for b in selected_brands:
    mean_price = autos[autos['brand']==b]['price'].mean()
    brands_mean_prices[b] = int(mean_price)
    
print(brands_mean_prices)

{'audi': 9336, 'volkswagen': 5402, 'ford': 3749, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628}


We can notice a gap in mean price among the top selling brands.

- Mercedes Benz, Volkswagen and Audi have the highest mean price. These german brands are reknowned for their quality.
- Opel and Ford are less expensive cars.
- Volkswagen is in-between. It may be a good quality to price ratio brand.

For these top 6 brands, let's now aggregate on the average mileage of the cars to see if there's any link with the calculated mean price.

In [39]:
brands_mean_mileages = {}

for b in selected_brands:
    mean_mileage = autos[autos['brand']==b]['odometer_km'].mean()
    brands_mean_mileages[b] = int(mean_mileage)

print(brands_mean_mileages)

{'audi': 129157, 'volkswagen': 128707, 'ford': 124266, 'bmw': 132572, 'opel': 129310, 'mercedes_benz': 130788}


To better compare each brand's mean price and mileage, we combine the data from both series objects into a single dataframe with a shared index.

In [40]:
bmp_series = pd.Series(brands_mean_prices)
bmm_series = pd.Series(brands_mean_mileages)

df = pd.DataFrame(bmp_series, columns = ['mean_price'])
df['mean_mileage'] = bmm_series

df.sort_values(by='mean_price', ascending=False)

Unnamed: 0,mean_price,mean_mileage
audi,9336,129157
mercedes_benz,8628,130788
bmw,8332,132572
volkswagen,5402,128707
ford,3749,124266
opel,2975,129310


The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.
This seems counter-intuitive at first as the value of a car will decrease as the mileage increases.
Audi, Mercedes and BMW are German brands and have the reputation of making reliable and long-lasting cars. That could explain why cars from these brands sell for higher prices than the ones from Volkswagen, Ford or Opel at equivalent mileage.

## Conclusion

In this guided project, we explored the german eBay cars ads dataset to look for the necessary cleaning actions and gain some insight on the data.

The data cleaning and exploring done in this project is not exhaustive and a lot more can be done on this dataset.

Below is a summary of the cleaning actions took and conlusions made from exploring the dataset.

The cleaning actions consisted in:
- renaming the columns names to make them more readable on usable
- getting rid of the useless descriptors such as the `num_photos`, `seller` and `offer_type` columns
- converting the `price` and `odometer` columns to numeric types
- renaming the `odometer` column to include the unit
- getting rid of the ads featuring an odd price
- getting rid of the ads featuring a registration year before 1900 or after the listing year

The exploration steps allowed us to get the following insights:
- the data were crawled from the website daily over a period of roughly one month
- the distribution of ads crawled over this period is uniform
- the ads were created over a period of roughly a year
- the dataset contains ads of cars from 40 different brands
- some brands are more represented than others
- among the most represented brands, german brands Audi, Mercedes and BMW feature higher ads prices on average than the others (Volkswagen, Opel and ford)
- the average mileage difference between these brands doesn't reflect this price difference
- the price difference may be explained by the reknown quality and longevity of german cars