# Exploring eBay Car Sales Data

In this guided project, we'll work with a dataset of used cars from *eBay Kleinanzeigen*, a classifieds section of the German eBay website.

The aim of this project is to clean the data and analyze the included used car listings. 

***
### Loading of the dataset
So lets first import the **pandas** and **numpy** library and load the dataset into the variable `autos`.

In [3]:
import pandas as pd
import numpy as np

autos = pd.read_csv("additional_files/autos.csv",encoding="Latin-1")

The first five rows of the `autos` dataset can be seen below:

In [10]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


General infomation about the dataset:

In [9]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

The general information of our dataset, reveal that some columns have **null values**: `vehicleType`, `gearbox`, `model`, `fuelType` and `notRepairedDamage`.<br>
In total, the dataset has 20 columns and 371528 rows.<br>
From those 20 columns, 13 columns contain strings and 7 integer values.


***
### Conversion of column names

Next, we want to convert the `camelcase` column names to `snakecase`:

In [11]:
# print existing column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [26]:
columns = list(autos.columns)

# change of some column names to be more descriptive
columns[7] = "registration_year"  # yearOfRegistration -> registration_year
columns[12] = "registration_month" # monthOfRegistration -> registration_month
columns[15] = "unrepaired_damage" # notRepairedDamage -> unrepaired_damage
columns[16] = "ad_created" # dateCreated -> ad_created

# convert all other columns to snacecase
for i in range(len(columns)):
    curr_col = columns[i]
    new_col = ""
    for char in curr_col:
        if char.isupper():
            char = "_"+char.lower()
        new_col = new_col + char
    columns[i] = new_col.strip()
columns[9] = columns[9].replace("p_s","ps")

autos.columns = columns

After replacing the column names the dataset looks like this:

In [28]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


***
### First basic data exploration and cleaning

For data cleaning, we fist have a look at the description of all columns:

In [29]:
autos.describe(include="all")

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-06 13:45:54
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


In [35]:
autos["unrepaired_damage"].value_counts()

nein    263182
ja       36286
Name: unrepaired_damage, dtype: int64

Taking into account the information given above, one can conclude the following:

1. The following columns can be dropped because they mostly contain one value:<br>`seller`and `offer_type`

2. The following columns need more investigation:<br>`date_crawled`, `registration_year`, `power_ps`, `ad_created`,`nr_of_pictures`, `postal_code` and `last_seed`.

3. The `kilometer` column should be change to `odometer_km`

Solution for point 3:

In [39]:
autos.rename({"kilometer":"odometer_km"},axis=1,inplace=True)

***
### Closer look at `price` and `odometer_km`

Question 1: How many unique values are contained?

In [50]:
print("Unique values in price:")
print(autos["price"].unique().shape[0])
print("\n")
print("Unique values in odometer_km:")
print(autos["odometer_km"].unique().shape[0])

Unique values in price:
5597


Unique values in odometer_km:
13


Question 2: What are statistical values like min, max, median, mean, etc.?

In [53]:
autos["price"].describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

Here, we can see that the `minimum` and `maximum` price fall out of line.<br>
The minimum price of `0` indicates, that some prices are missing.<br>
The maximum price of `2.15E9` looks like an outlier, because it is way to big.

In [64]:
# Observe the thirty highest occouring prices:
autos["price"].value_counts().sort_index(ascending=False).head(30)

2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
10000000       8
9999999        3
3895000        1
3890000        1
2995000        1
2795000        1
1600000        2
1300000        1
1250000        2
1234566        1
1111111        2
1010010        1
1000000        5
999999        13
999990         1
911911         1
849000         1
820000         1
780000         1
745000         2
Name: price, dtype: int64

From this list of highest prices we descide to ommit rows with prices higher than `100,000` and also we want just rows with prices higher than `0`:

In [71]:
autos_filtered = autos[autos["price"].between(0,100000)]
print(autos_filtered["price"].unique().shape[0])
autos_filtered["price"].describe()

5385


count    371125.000000
mean       5607.804734
std        7503.923817
min           0.000000
25%        1150.000000
50%        2950.000000
75%        7199.000000
max      100000.000000
Name: price, dtype: float64

In [54]:
autos["odometer_km"].describe()

count    371528.000000
mean     125618.688228
std       40112.337051
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

The column `odometer_km` looks good as it is.

***
### Taking care of columns with date values

The columns `date_crawled`, `last_seen` and `ad_created` countain date values. Because, pandas recognised them as string columns, we have to convert them to date values.

Let's first get a glimps on these columns:

In [75]:
autos_filtered[["date_crawled","ad_created","last_seen"]][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-24 11:52:17,2016-03-24 00:00:00,2016-04-07 03:16:57
1,2016-03-24 10:58:45,2016-03-24 00:00:00,2016-04-07 01:46:50
2,2016-03-14 12:52:21,2016-03-14 00:00:00,2016-04-05 12:47:46
3,2016-03-17 16:54:04,2016-03-17 00:00:00,2016-03-17 17:40:17
4,2016-03-31 17:25:20,2016-03-31 00:00:00,2016-04-06 10:17:21


In [87]:
autos_filtered["last_seen"].str[:10].value_counts(normalize=True,dropna=False).sort_index().head()

2016-03-05    0.001291
2016-03-06    0.004139
2016-03-07    0.005265
2016-03-08    0.008057
2016-03-09    0.009997
Name: last_seen, dtype: float64

This five examples of the `last_seen` column let us observe how the day is saved in the date values (%Y-%m-%d). This is also valid for `date_crawled` and `ad_created`.


Next, let's have a look at the `registration_year` data:

In [86]:
autos["registration_year"].describe()

count    371528.000000
mean       2004.577997
std          92.866598
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

Here, we can see that the `maximum = 9999` and `minimum = 1000` value of the `registration_year` can not be real data. Hence, we have to clean it as well.

First, we count how many rows fall outside the registration widow of 1900 to 2016.

In [90]:
print(autos_filtered[autos_filtered["registration_year"].between(1900,2016)].shape[0],"of former",autos_filtered.shape[0])

356388 of former 371125


This is not a too high loss of rows if we deside to drop the rows with `registration_year` outside the 1900-2016 window.

In [94]:
autos_filtered = autos_filtered[autos_filtered["registration_year"].between(1900,2016)]
print("New number of rows:",autos_filtered.shape[0])

New number of rows: 356388


***
### Aggregation of the `brand` column

In this section we want to find the mean price of the top ten best-selling brands.

First we have to have a look at the unique values in the `brand` column and how often the are sould.

In [96]:
autos_filtered["brand"].value_counts()

volkswagen        75751
bmw               39112
opel              38197
mercedes_benz     34189
audi              31862
ford              24557
renault           16967
peugeot           10653
fiat               9189
seat               6646
skoda              5496
mazda              5475
smart              5032
citroen            4949
nissan             4841
toyota             4547
sonstige_autos     3713
hyundai            3507
mini               3286
volvo              3257
mitsubishi         2948
honda              2707
kia                2454
alfa_romeo         2264
suzuki             2252
porsche            2037
chevrolet          1788
chrysler           1406
dacia               874
daihatsu            780
jeep                779
land_rover          759
subaru              758
jaguar              606
trabant             577
saab                518
daewoo              513
rover               463
lancia              461
lada                218
Name: brand, dtype: int64

According to our frequency table, the following are aur top ten saled brands:According to our frequency table, the following are the top ten best-selling brands:

In [99]:
top_ten_brands = autos_filtered["brand"].value_counts().head(10).index
top_ten_brands

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')

In the next step, we loop over each brand and calculate the mean price of each brand and present the result as a printed output:

In [103]:
aggregated_brand_mean_price = {}
for brand in top_ten_brands:
    sub_data = autos_filtered.loc[autos_filtered["brand"]==brand,"price"]
    curr_mean = sub_data.mean()
    aggregated_brand_mean_price[brand] = curr_mean
    print("Mean price of {} is: {:.2f}".format(brand,curr_mean))

Mean price of volkswagen is: 5222.66
Mean price of bmw is: 8165.83
Mean price of opel is: 2870.05
Mean price of mercedes_benz is: 8215.98
Mean price of audi is: 8810.40
Mean price of ford is: 3585.98
Mean price of renault is: 2360.01
Mean price of peugeot is: 3206.27
Mean price of fiat is: 2803.86
Mean price of seat is: 4397.76


Here we can see, the three brands with the highest mean price are **Audi**, **Mercedes Benz** and **BMW**.

In contrast the two of the top ten best-selling brands with the lowest mean price are **Renault** and **Fiat**

***
### Combination of **Mean Price** and **Mean Mileage** of the top ten best-selling brands

In [113]:
aggregated_mean_mileage = {}
for brand in top_ten_brands:
    sub_data = autos_filtered.loc[autos_filtered["brand"]==brand,"odometer_km"]
    curr_mean = sub_data.mean()
    aggregated_mean_mileage[brand] = curr_mean

mean_price_series = pd.Series(aggregated_brand_mean_price)
mean_mileage_series = pd.Series(aggregated_mean_mileage)

Aggregation_dataframe = pd.DataFrame(mean_price_series,columns=["mean_price"]).sort_values("mean_price",ascending=False)
Aggregation_dataframe["mean_mileage"] = mean_mileage_series

Aggregation_dataframe

Unnamed: 0,mean_price,mean_mileage
audi,8810.401513,129529.21976
mercedes_benz,8215.98248,130671.122291
bmw,8165.831049,132688.688893
volkswagen,5222.656269,128341.012
seat,4397.76437,120911.826663
ford,3585.976911,123621.777904
peugeot,3206.26631,124599.643293
opel,2870.053774,128756.839542
fiat,2803.857221,116523.560779
renault,2360.006896,127877.35015


This comparison of the `mean_price` and the `mean_mileage` of the top ten best-selling brands reveales that the best-selling brands with the highest `mean_price` are also the brands with the highest `mean_mileage`.