# INTRODUCTION

It this project, the following dataset of [used cars from eBay Kleinanzeigen](https://data.world/data-society/used-cars-data), a classifieds section of the German eBay website, will be used.

### Table of Contents <a class="anchor" id="s0"></a>

* [Dataset and aim of the study](#s1)
* [Data cleaning and exploration](#s2)
* [Odometer investigation](#s3)
* [Registration year investigation](#s4)
* [Exploring Price by Brand](#s5)
* [Correlation between price and mileage (odometer_km)?](#s6)
* [Further data cleaning](#s7)
* [Most common brand/model combinations](#s8)
* [How much cheaper are cars with damage than their non-damaged counterparts?](#s9)
    

## Dataset and aim of the study <a class="anchor" id="s1"></a>

<span style='background :yellow' > Few modifications were made from the original dataset:</span>
- Sampled down to 50,000 data points to ensure the code runs quickly
- Modified to bring it closer to a real scraped dataset

---
--- 

The data dictionary provided with the dataset is as follows:

| Column name | Description |
| :--- | :--- |
|dateCrawled| When this ad was first crawled. All field-values are taken from this date.|
|seller| Whether the seller is private or a dealer.|
|offerType | The type of listing.| 
|price| The price on the ad to sell the car.|
|abtest| Whether the listing is included in an A/B test.|
|vehicleType| The vehicle Type.|
|yearOfRegistration| The year in which the car was first registered.|
|gearbox| The transmission type.|
|powerPS| The power of the car in PS.|
|model| The car model name.|
|kilometer| How many kilometers the car has driven.|
|monthOfRegistration| The month in which the car was first registered.|
|fuelType| What type of fuel the car uses.|
|brand| The brand of the car.|
|notRepairedDamage| If the car has a damage which is not yet repaired.|
|dateCreated| The date on which the eBay listing was created.|
|nrOfPictures| The number of pictures in the ad.|
|postalCode| The postal code for the location of the vehicle.|
|lastSeenOnline| When the crawler saw this ad last online.|

---

### <span style='color:Blue'> The aim of this project is to clean the data and analyze the included used car listings.<span>

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

We can see that a lot of data columns are object data type despit it should be integer or float, such as:
- dateCrawled
- price                
- odometer             
- dateCreated          
- lastSeen             

# Data cleaning and exploration <a class="anchor" id="s2"></a>

The value of rangeindex provided for the index axis indicates **371528** entries.

The summary includes list of all columns with their data types, most of which are strings, and the number of non-null values in each column. 

The following columns have null values, all with string values:
- vehicleType 
- gearbox
- model
- fuelType
- notRepairedDamage

Note that column names use camelcase instead of Python's preferred snakecase. 


In [4]:
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

autos.columns = new_columns

print(autos.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


The column names were converted from camelcase to snakecase and some of the column names were reworded based on the data dictionary to be more descriptive.

In [5]:
autos.gearbox.value_counts()

manuell      36993
automatik    10327
Name: gearbox, dtype: int64

Looking at the data, we can see that <span style='background:pink'>price</span> and <span style='background:pink'>odometer</span> columns are numeric values stored as text. 

For each column:
- Any non-numeric character will be removed;
- The column will be converted to a numeric dtype.

In [6]:
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
autos["price"] = autos["price"].astype(int)
autos["price"].head()

  autos["price"] = autos["price"].str.replace('$','')


0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

In [7]:
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','')
autos["odometer"] = autos["odometer"].astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)

In [8]:
autos["odometer_km"].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int32

Now, let's check these column to look for any values that look unrealistically high or low (outliers) that might be good to remove.

## Odometer investigation <a class="anchor" id="s3"></a>
---

In [9]:
# how many unique values?
autos["odometer_km"].unique().shape

(13,)

In [10]:
# Statistics
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [11]:
# Value count in ascending order
autos["odometer_km"].value_counts().sort_index(ascending=True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

In [12]:
autos[(autos["odometer_km"]==150000)].sort_index(ascending=False).head(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
49999,2016-03-14 00:42:12,Opel_Vectra_1.6_16V,privat,Angebot,1250,control,limousine,1996,manuell,101,vectra,150000,1,benzin,opel,nein,2016-03-13 00:00:00,0,45897,2016-04-06 21:18:48
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,1980,control,cabrio,1996,manuell,75,astra,150000,5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49994,2016-03-22 17:36:42,Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...,privat,Angebot,5000,control,kombi,2001,automatik,299,a6,150000,1,benzin,audi,nein,2016-03-22 00:00:00,0,46537,2016-04-06 08:16:39
49993,2016-03-15 18:47:35,Audi_A3__1_8l__Silber;_schoenes_Fahrzeug,privat,Angebot,1650,control,kleinwagen,1997,manuell,0,,150000,7,benzin,audi,,2016-03-15 00:00:00,0,65203,2016-04-06 19:46:53
49991,2016-03-06 15:25:19,Kleinwagen,privat,Angebot,500,control,,2016,manuell,0,twingo,150000,0,benzin,renault,,2016-03-06 00:00:00,0,61350,2016-03-06 18:24:19


Looking at the price and brand car, the data for the **odometer_km** column are coherent and are not outliers. A quick look at this [website](https://www.autoscout24.com/lst/opel/vectra) confirms this hypothesis.

## Price investigation <a class="anchor" id="s4"></a>
---

In [13]:
# how many unique values?
autos["price"].unique().shape

(2357,)

In [14]:
# Statistics
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [15]:
# Value count in descending order
autos["price"].value_counts().sort_index(ascending=False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64

A quick look at the price value count shows us <span style='background :yellow'> **OUTLIERS** with price > 350 000 $ </span>.

Let's confirm by looking closely at the most expansive used car of the database if it "makes sense".

In [16]:
autos[autos["price"]>=350000].sort_values(by='price', ascending=True).head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
36818,2016-03-27 18:37:37,Porsche_991,privat,Angebot,350000,control,coupe,2016,manuell,500,911,5000,3,benzin,porsche,nein,2016-03-27 00:00:00,0,70499,2016-03-27 18:37:37
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,privat,Angebot,999990,test,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,0,50997,2016-03-29 11:38:54
514,2016-03-17 09:53:08,Ford_Focus_Turnier_1.6_16V_Style,privat,Angebot,999999,test,kombi,2009,manuell,101,focus,125000,4,benzin,ford,nein,2016-03-17 00:00:00,0,12205,2016-04-06 07:17:35
43049,2016-03-21 19:53:52,2_VW_Busse_T3,privat,Angebot,999999,test,bus,1981,manuell,70,transporter,150000,1,benzin,volkswagen,,2016-03-21 00:00:00,0,99880,2016-03-28 17:18:28
22947,2016-03-22 12:54:19,Bmw_530d_zum_ausschlachten,privat,Angebot,1234566,control,kombi,1999,automatik,190,,150000,2,diesel,bmw,,2016-03-22 00:00:00,0,17454,2016-04-02 03:17:32


A quick look at this [website](https://www.reezocar.com/en/) shows **Porsh 991** price around 345,000 \\$  for 5000 km used. But a **Volkswagen Jetta Gt** from 1989 with 280,000 km used is sold at 2,230 \\$ approximately. This confirms OUTLIERS with price > 350,000 \\$

In [17]:
# Value count in ascending order
autos["price"].value_counts().sort_index(ascending=True).head()

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64

In [18]:
autos[autos["price"] == 0].head(10)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
27,2016-03-27 18:45:01,Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE,privat,Angebot,0,control,,2005,,0,,150000,0,,ford,,2016-03-27 00:00:00,0,66701,2016-03-27 18:45:01
71,2016-03-28 19:39:35,Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re...,privat,Angebot,0,control,,1990,manuell,0,,5000,0,benzin,opel,,2016-03-28 00:00:00,0,4552,2016-04-07 01:45:48
80,2016-03-09 15:57:57,Nissan_Primera_Hatchback_1_6_16v_73_Kw___99Ps_...,privat,Angebot,0,control,coupe,1999,manuell,99,primera,150000,3,benzin,nissan,ja,2016-03-09 00:00:00,0,66903,2016-03-09 16:43:50
87,2016-03-29 23:37:22,Bmw_520_e39_zum_ausschlachten,privat,Angebot,0,control,,2000,,0,5er,150000,0,,bmw,,2016-03-29 00:00:00,0,82256,2016-04-06 21:18:15
99,2016-04-05 09:48:54,Peugeot_207_CC___Cabrio_Bj_2011,privat,Angebot,0,control,cabrio,2011,manuell,0,2_reihe,60000,7,diesel,peugeot,nein,2016-04-05 00:00:00,0,99735,2016-04-07 12:17:34
118,2016-03-12 05:03:00,VW_Sharan_V6_204_PS_Karosse_Rohkarosse_mit_Pap...,privat,Angebot,0,control,bus,2001,manuell,204,sharan,150000,7,benzin,volkswagen,ja,2016-03-12 00:00:00,0,15370,2016-03-12 21:44:23
146,2016-03-22 23:59:28,Ford_Fiesta_rot,privat,Angebot,0,test,kleinwagen,1996,manuell,75,fiesta,20000,8,benzin,ford,,2016-03-22 00:00:00,0,63069,2016-04-01 20:16:38
167,2016-04-02 19:43:45,Suche_VW_Multivan_Innenausstattung_Set_oder_TE...,privat,Angebot,0,control,,2011,,0,transporter,5000,0,,volkswagen,,2016-04-02 00:00:00,0,64739,2016-04-06 19:45:08
180,2016-03-19 10:50:25,Zu_verkaufen,privat,Angebot,0,test,,2016,manuell,98,3_reihe,150000,12,benzin,mazda,ja,2016-03-19 00:00:00,0,30966,2016-03-24 03:17:21
226,2016-03-25 23:52:12,Porsche_911_S_Targa__67er_SWB,privat,Angebot,0,control,cabrio,1967,manuell,160,911,5000,12,benzin,porsche,nein,2016-03-25 00:00:00,0,44575,2016-04-05 14:46:39


A price of 0\\$ usually indicated a price on demand. A quick look at this [website](https://www.reezocar.com/en/) shows **Porsche Targa 911 S** with a price on demand. If we check used **Porsche 911** for the years arounf 1970 and 5000 km used, prices are superior to 80,000 \\$. So we might want to exclude the price on demand, hence the rows with a price == 0\\$.

In [19]:
clean_autos=autos[(autos["price"] >= 1) & (autos["price"] <= 350000)]

## Registration year investigation <a class="anchor" id="s4"></a>
---

The <span style='background:pink'>date_crawled</span>, <span style='background:pink'>last_seen</span> and <span style='background:pink'>ad_created</span> columns are all identified as string values. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values.

In [20]:
clean_autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [21]:
# dropna = 'True' will include missing values, and normalize = true will use percentages instead of counts
print("First added by the crawler: \n")
df1=clean_autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
print(df1)

First added by the crawler: 

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64


In [22]:
print("\nWhen the crawler saw this ad last online: \n")
df2=clean_autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
print(df2)


When the crawler saw this ad last online: 

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64


In [23]:
print("\nThe date on which the eBay listing was created.\n")
df3=clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
print(df3)


The date on which the eBay listing was created.

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64


We can see there are inaccuracies between `date_crawled` or `ad_created` for any car with a `registration_year` above 2016. Indeed, a car cannot be first registered after the eBay listing.

In [24]:
clean_autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The minimum and maximum values indicate odd values: 
- `min` = 1000, which is inferior to the [year 1886 is regarded as the birth year of the modern car](https://en.wikipedia.org/wiki/Car#:~:text=The%20year%201886%20is%20regarded,by%20the%20Ford%20Motor%20Company.)
- `max` = 9999, which is far superior to the current year 

In [25]:
clean_autos['registration_year'].value_counts().sort_index().head(10)

1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
Name: registration_year, dtype: int64

A quick search on internet indicates cars were invented in the late 1800s. Let's see what kind of cars are listed for the year `1910` and `1927`.

In [26]:
clean_autos[(clean_autos['registration_year']==1910) | (clean_autos['registration_year']==1927)].sort_values(by='registration_year', ascending=True)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
3679,2016-04-04 00:36:17,Suche_Auto,privat,Angebot,1,test,,1910,,0,,5000,0,,sonstige_autos,,2016-04-04 00:00:00,0,40239,2016-04-04 07:49:15
22659,2016-03-14 08:51:18,Opel_Corsa_B,privat,Angebot,500,test,,1910,,0,corsa,150000,0,,opel,,2016-03-14 00:00:00,0,52393,2016-04-03 07:53:55
28693,2016-03-22 17:48:41,Renault_Twingo,privat,Angebot,599,control,kleinwagen,1910,manuell,0,,5000,0,benzin,renault,,2016-03-22 00:00:00,0,70376,2016-04-06 09:16:59
30781,2016-03-25 13:47:46,Opel_Calibra_V6_DTM_Bausatz_1:24,privat,Angebot,30,test,,1910,,0,calibra,100000,0,,opel,,2016-03-25 00:00:00,0,47638,2016-03-26 23:46:29
45157,2016-03-11 22:37:01,Motorhaube,privat,Angebot,15,control,,1910,,0,,5000,0,,trabant,,2016-03-11 00:00:00,0,90491,2016-03-25 11:18:57
21416,2016-03-12 08:36:21,Essex_super_six__Ford_A,privat,Angebot,16500,control,cabrio,1927,manuell,40,andere,5000,5,benzin,ford,,2016-03-12 00:00:00,0,74821,2016-03-15 12:45:12


A quick search on internet indicates Renault Twingo first generation was in 1993. But the [Essex super six Ford]('https://www.conceptcarz.com/vehicle/z11897/essex-super-six.aspx') is a car from the year `1927` which is the lowest acceptable registration year of the dataset.

In [27]:
clean_autos['registration_year'].value_counts().sort_index().tail(10)

2800    1
4100    1
4500    1
4800    1
5000    4
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, dtype: int64

Let's remove the values outside the 1927 - 2016 interval.

In [28]:
clean_autos=clean_autos[(clean_autos['registration_year'] >=1927) & (clean_autos['registration_year'] <=2016)]
clean_autos['registration_year'].describe()

count    46676.000000
mean      2002.920709
std          7.120843
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

In [29]:
clean_autos['registration_year'].value_counts(normalize=True).head(10).sort_index()

1998    0.050626
1999    0.062066
2000    0.067615
2001    0.056474
2002    0.053261
2003    0.057824
2004    0.057910
2005    0.062902
2006    0.057203
2007    0.048783
Name: registration_year, dtype: float64

Cars with a registration year between the `1998-2007` interval are the most listed on eBay. `2000`being the registration year with most listed cars.

## Exploring Price by Brand <a class="anchor" id="s5"></a>

In [30]:
clean_autos['brand'].value_counts(normalize=True, sort=True)

volkswagen        0.211286
bmw               0.110057
opel              0.107550
mercedes_benz     0.096474
audi              0.086576
ford              0.069907
renault           0.047133
peugeot           0.029844
fiat              0.025645
seat              0.018275
skoda             0.016411
nissan            0.015276
mazda             0.015190
smart             0.014161
citroen           0.014011
toyota            0.012705
hyundai           0.010027
sonstige_autos    0.009791
volvo             0.009148
mini              0.008763
mitsubishi        0.008227
honda             0.007841
kia               0.007070
alfa_romeo        0.006642
porsche           0.006127
suzuki            0.005935
chevrolet         0.005699
chrysler          0.003514
dacia             0.002635
daihatsu          0.002507
jeep              0.002271
subaru            0.002142
land_rover        0.002100
saab              0.001650
jaguar            0.001564
daewoo            0.001500
trabant           0.001371
r

Let's explore variations across different car brands in terms of __price__. In order to do so, we will aggregate over the top 20 brands using `Series.index` attribute to access the labels.

In [31]:
selected_brands= clean_autos['brand'].value_counts(normalize=True, sort=True).index[:20]
selected_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'],
      dtype='object')

In [32]:
brand_mean_price = {}

for b in selected_brands:
    sel_brand = clean_autos[clean_autos['brand']== b]
    brand_mean_price[b] = sel_brand['price'].mean().round()
    
brand_mean_price_sorted=sorted(brand_mean_price.items(), key=lambda item: item[1], reverse = True)

In [33]:
# Use .from_records() method to creates a DataFrame object from a structured ndarray
pd.DataFrame.from_records(brand_mean_price_sorted, columns=["Brand", "Price"])

Unnamed: 0,Brand,Price
0,sonstige_autos,12366.0
1,mini,10613.0
2,audi,9337.0
3,mercedes_benz,8628.0
4,bmw,8333.0
5,skoda,6368.0
6,volkswagen,5402.0
7,hyundai,5365.0
8,toyota,5167.0
9,volvo,4947.0


As we can see, __sonstige__ and __mini__ cars are the most expansive cars listed on eBay with an average price superior to 10k\\$. __Audi__ cars are not far behind with an average price superior to 9k\\$  The least expansive listed cars on eBay are __opel__, __fiat__ and __renault__ cars with an average price under 3k\\$. Let's use aggregation to understand the average mileage for those cars and see if there's any visible link with mean price.

## Correlation between price and mileage (odometer_km)? <a class="anchor" id="s6"></a>

In [34]:
brand_mean_odometer = {}

for b in selected_brands:
    sel_brand = clean_autos[clean_autos['brand']== b]
    brand_mean_odometer[b] = sel_brand['odometer_km'].mean().round()

# Convert both dictionaries to series objects, using the series constructor
bmp_series= pd.Series(brand_mean_price)
bmo_series= pd.Series(brand_mean_odometer)

#Create a dataframe from the first series object using the dataframe constructor
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df['mean_odometer_km'] = bmo_series
df.sort_values('mean_price', ascending=False)

Unnamed: 0,mean_price,mean_odometer_km
sonstige_autos,12366.0,90142.0
mini,10613.0,88105.0
audi,9337.0,129157.0
mercedes_benz,8628.0,130788.0
bmw,8333.0,132573.0
skoda,6368.0,110849.0
volkswagen,5402.0,128707.0
hyundai,5365.0,106442.0
toyota,5167.0,115944.0
volvo,4947.0,138068.0


As we can see, __sonstige__ and __mini__ cars stand out a little bit with an average mileage lower than the rest of the cars. Unfortunately, it is difficult to judge the impact of mileage among the different car brands as such. Let's create a subset of the dataframe for __Upper middle class cars__. Mileage will be grouped into 3 bins in order to have a wider sens of its influence on the average price.

In [35]:
upper_middle_class = clean_autos[(clean_autos['brand']== 'sonstige_autos')|(clean_autos['brand']== 'mini')|(clean_autos['brand']== 'audi')|(clean_autos['brand']== 'mercedes_benz')|(clean_autos['brand']== 'bmw')]
print("The 3 intervals of mileage for upper middle class cars:")
upper_middle_class['odometer_km'].value_counts(bins=3)

The 3 intervals of mileage for upper middle class cars:


(101666.667, 150000.0]     11340
(53333.333, 101666.667]     1959
(4854.999, 53333.333]       1248
Name: odometer_km, dtype: int64

In [36]:
# Mileage group 1 (101666.667, 150000.0]
price1=upper_middle_class.loc[upper_middle_class['odometer_km'] >= 101666.667, 'price'].mean().round()
print("Average price for cars with mileage range from 101666.667 to 150000.0 kms: ", price1, "$")

# Mileage group2 (53333.333, 101666.667]
price2=upper_middle_class.loc[(upper_middle_class['odometer_km'] > 53333.333) & (upper_middle_class['odometer_km'] <101666.667), 'price'].mean().round()
print("Average price for cars with mileage range from 53333.333 to 101666.667 kms: ", price2, "$")

# Mileage group 3 (4854.999, 53333.333]
price3=upper_middle_class.loc[upper_middle_class['odometer_km'] <= 53333.333, 'price'].mean().round()
print("Average price for cars with mileage range from 4854.999 to 53333.333 kms: ", price3, "$")

Average price for cars with mileage range from 101666.667 to 150000.0 kms:  6272.0 $
Average price for cars with mileage range from 53333.333 to 101666.667 kms:  15882.0 $
Average price for cars with mileage range from 4854.999 to 53333.333 kms:  21746.0 $


### Here are some conclusion we can draw from these results:
- In general, brand reputation affects more the price listed than recorded mileage
- For upper middle class cars, price tends to be less expansive with higher mileage

## Further data cleaning <a class="anchor" id="s7"></a>

As further data cleaning, something we could do is to identify categorical data that uses german words to translate them and map the calues to their english counterparts.

As we saw in the [Dataset and aim of the study](#s1), the following columns have the dtypes `object`:

In [37]:
categorical_data = clean_autos.select_dtypes(include='object')
categorical_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46676 entries, 0 to 49999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   date_crawled       46676 non-null  object
 1   name               46676 non-null  object
 2   seller             46676 non-null  object
 3   offer_type         46676 non-null  object
 4   abtest             46676 non-null  object
 5   vehicle_type       43976 non-null  object
 6   gearbox            44570 non-null  object
 7   model              44486 non-null  object
 8   fuel_type          43362 non-null  object
 9   brand              46676 non-null  object
 10  unrepaired_damage  38374 non-null  object
 11  ad_created         46676 non-null  object
 12  last_seen          46676 non-null  object
dtypes: object(13)
memory usage: 5.0+ MB


Let's remove all the columns related with dates or to the properties of the car such as the columns __model__ and __brand__. Let's use the pandas function `.unique` to verify what kind of **unique** value each column returns and if they are in German and translate them in english using Google Translate and finally using the pandas function `.replace` to map the values to their english counterparts.

In [38]:
print("Unique value for 'seller' column - before replace : \n", categorical_data['seller'].unique())
clean_autos['seller'].replace({"gewerblich":"commercial"},inplace = True)
print("\nUnique value for 'seller' column - after replace : \n", clean_autos['seller'].unique())

Unique value for 'seller' column - before replace : 
 ['privat' 'gewerblich']

Unique value for 'seller' column - after replace : 
 ['privat' 'commercial']


In [39]:
print("Unique value for 'offer_type' column - before replace : \n", categorical_data['offer_type'].unique())
clean_autos['offer_type'].replace({"Angebot":"offer"},inplace = True)
print("\nUnique value for 'offer_type' column - after replace :\n", clean_autos['offer_type'].unique())

Unique value for 'offer_type' column - before replace : 
 ['Angebot']

Unique value for 'offer_type' column - after replace :
 ['offer']


In [40]:
# no German words to translate
categorical_data['abtest'].unique()

array(['control', 'test'], dtype=object)

In [41]:
# to return unique values without the NaN, simply chain the dropna and unique functions together
print("Unique value for 'vehicle_type' column - before replace : \n", categorical_data['vehicle_type'].dropna().unique())
clean_autos['vehicle_type'].replace({"kleinwagen":"small car", "kombi":"combi", "andere":"other"},inplace = True)
print("\nUnique value for 'vehicle_type' column - after replace : \n", clean_autos['vehicle_type'].dropna().unique())

Unique value for 'vehicle_type' column - before replace : 
 ['bus' 'limousine' 'kleinwagen' 'kombi' 'coupe' 'suv' 'cabrio' 'andere']

Unique value for 'vehicle_type' column - after replace : 
 ['bus' 'limousine' 'small car' 'combi' 'coupe' 'suv' 'cabrio' 'other']


In [42]:
print("Unique value for 'gearbox' column - before replace : \n", categorical_data['gearbox'].dropna().unique())
clean_autos['gearbox'].replace({"manuell":"manually", "automatik":"automatic"},inplace = True)
print("\nUnique value for 'gearbox' column - after replace : \n", clean_autos['gearbox'].dropna().unique())

Unique value for 'gearbox' column - before replace : 
 ['manuell' 'automatik']

Unique value for 'gearbox' column - after replace : 
 ['manually' 'automatic']


In [43]:
print("Unique value for 'fuel_type' column - before replace : \n", categorical_data['fuel_type'].dropna().unique())
clean_autos['fuel_type'].replace({"benzin":"petrol", "elektro":"electro", "andere":"other"},inplace = True)
print("\nUnique value for 'fuel_type' column - after replace : \n", clean_autos['fuel_type'].dropna().unique())

Unique value for 'fuel_type' column - before replace : 
 ['lpg' 'benzin' 'diesel' 'cng' 'hybrid' 'elektro' 'andere']

Unique value for 'fuel_type' column - after replace : 
 ['lpg' 'petrol' 'diesel' 'cng' 'hybrid' 'electro' 'other']


In [44]:
print("Unique value for 'unrepaired_damage' column - before replace : \n", categorical_data['unrepaired_damage'].dropna().unique())
clean_autos['unrepaired_damage'].replace({"nein":"No", "ja":"Yes"},inplace = True)
print("\nUnique value for 'unrepaired_damage' column - after replace : \n", clean_autos['unrepaired_damage'].dropna().unique())

Unique value for 'unrepaired_damage' column - before replace : 
 ['nein' 'ja']

Unique value for 'unrepaired_damage' column - after replace : 
 ['No' 'Yes']


## Most common brand/model combinations <a class="anchor" id="s8"></a>

In [45]:
# Use .size() property to get an int representing the number of elements in this object
print("Top 5 Most common brand/model combinations :\n \n", clean_autos.groupby(['brand', 'model']).size().sort_values(ascending=False).head())

Top 5 Most common brand/model combinations :
 
 brand       model 
volkswagen  golf      3707
bmw         3er       2615
volkswagen  polo      1609
opel        corsa     1591
volkswagen  passat    1349
dtype: int64


## How much cheaper are cars with damage than their non-damaged counterparts? <a class="anchor" id="s9"></a>

Let's explore variations across cars with damage than their non-damaged counterparts in terms of price. In order to do so, we will aggregate over the top 5 brands using Series.index attribute to access the labels. 

In [46]:
# Select upper middle class cars
clean_upper_middle_class=clean_autos.loc[(clean_autos['brand']== 'sonstige_autos')|(clean_autos['brand']== 'mini')|(clean_autos['brand']== 'audi')|(clean_autos['brand']== 'mercedes_benz')|(clean_autos['brand']== 'bmw')]

# group by 'unrepaired_damage' & 'brand' and select 'price' column
df_damage = clean_upper_middle_class.groupby(['unrepaired_damage', 'brand'], as_index=False).price.mean().round()

# unrepaired_damage= df_damage[df_damage['unrepaired_damage']=='No'].sort_values(by = 'price', ascending=False)
# repaired_damage= df_damage[df_damage['unrepaired_damage']=='Yes'].sort_values(by = 'price', ascending=False)

unrepaired_damage= df_damage[df_damage['unrepaired_damage']=='No']
repaired_damage= df_damage[df_damage['unrepaired_damage']=='Yes']
display(unrepaired_damage)
display(repaired_damage)

Unnamed: 0,unrepaired_damage,brand,price
0,No,audi,10915.0
1,No,bmw,9438.0
2,No,mercedes_benz,9798.0
3,No,mini,11158.0
4,No,sonstige_autos,15706.0


Unnamed: 0,unrepaired_damage,brand,price
5,Yes,audi,3325.0
6,Yes,bmw,3513.0
7,Yes,mercedes_benz,3922.0
8,Yes,mini,4595.0
9,Yes,sonstige_autos,6375.0


In [47]:
# Reset index for comparison
unrepaired_damage.reset_index(inplace=True)
repaired_damage.reset_index(inplace=True)

# Extract "price" column for unrepaired_damage cars 
unrepaired_price=unrepaired_damage.loc[:,'price']

# Extract "price" column for repaired_damage cars 
repaired_price=repaired_damage.loc[:,'price']

# mean difference between the 2 "price" columns
price_damage_comparison=unrepaired_damage['price'] - repaired_damage['price']

# mean percentage difference between the 2 "price" columns
price_damage_percent_diff=((unrepaired_damage['price'] - repaired_damage['price'])/repaired_damage['price']).round()

#Create a dataframe from the first series object using the dataframe constructor
df_damage = pd.DataFrame(unrepaired_damage.loc[:,'brand'], columns=['brand'])
df_damage['price_mean_diff'] = price_damage_comparison
df_damage.sort_values('price_mean_diff', ascending=False)
df_damage['price_diff (%)'] = price_damage_percent_diff
df_damage

Unnamed: 0,brand,price_mean_diff,price_diff (%)
0,audi,7590.0,2.0
1,bmw,5925.0,2.0
2,mercedes_benz,5876.0,1.0
3,mini,6563.0,1.0
4,sonstige_autos,9331.0,1.0


Unsurprisingly, the cars with repaired damages are more expansive that their non-repaired counterparts. 

[![alt text](BackToTop.png "Back to the top")](#s0)