# Exploring Ebay Car Sales Data

In this project, we will analyze the data of used cars from eBay website.

Let's first read the data

In [2]:
import pandas as pd
import numpy as np
autos = pd.read_csv('C:/Users/Пользователь/Desktop/data science/autos.csv', encoding='Latin-1')

In [3]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,privat,Angebot,2200,test,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


Below are some points to desribe the column in the data:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

In [4]:
print(autos.info())
print(autos.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

And here are some thing that we could notice from the findings above:

- Some columns have null values.
- The column names use camelcase instead of Python's preferred snakecase.

Let's find column(s) which are numbers but expressed in strings and convert them to numerical values. But before that, we should convert the column names from camelcase to snakecase to improve readability.

In [5]:
columns = autos.columns
map_dict = {
    'dateCrawled' : 'date_crawled',
    'name' : 'name',
    'seller' : 'seller',
    'offerType' : 'offer_type',
    'price' : 'price',
    'abtest' : 'abtest',
    'vehicleType' : 'vehicle_type',
    'yearOfRegistration':'registration_year',
    'gearbox' : 'gear_box',
    'powerPS' : 'power_ps',
    'model' : 'model',
    'kilometer' : 'kilometer',
    'monthOfRegistration' : 'registration_month',
    'fuelType' : 'fuel_type',
    'brand' : 'brand',
    'notRepairedDamage' : 'unrepaired_damage',
    'dateCreated' : 'ad_created',
    'nrOfPictures' : 'nr_of_Pictures',
    'postalCode' : 'postal_code',
    'lastSeen' : 'last_seen'
}
autos.columns = columns.map(map_dict)



In [6]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gear_box,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_Pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


In [7]:
autos['price'].unique().shape


(5597,)

In [8]:
autos['price'].describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

It turns out that there is a mistake. The max data of price is 100 larger than the 75% price of the whole column. This needs more investigation which data is wrong.

In [9]:
autos['price'].value_counts().sort_index(ascending=False).head(15)

2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
10000000       8
9999999        3
3895000        1
3890000        1
2995000        1
Name: price, dtype: int64

In [10]:
autos['price'].value_counts().sort_index(ascending=True).head(15)

0     10778
1      1189
2        12
3         8
4         1
5        26
7         3
8         9
9         8
10       84
11        5
12        8
13        7
14        5
15       27
Name: price, dtype: int64

According to eBay the most expensive vehicles sold on eBay cost around $3 million. This clearly means prices above 3 milion needs to be removed. However, if we're to see datas even below that, there seems to be a weird jump of prices between 350k and 1 million dollar. So, in this project, we'll remove prices above 350k.

And since auctions at eBay that start with $1 is not too uncommon, we'll start the price from there and exclude those zeroes.

In [11]:
autos = autos[autos["price"].between(1,351000)]

In [12]:
autos['price'].describe()

count    360635.000000
mean       5898.671956
std        8866.359669
min           1.000000
25%        1250.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

In [13]:
autos['registration_year'].describe()

count    360635.000000
mean       2004.433133
std          81.016977
min        1000.000000
25%        1999.000000
50%        2004.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

We can see, there's so many wrong-inputted registration year. This column should describe the year the car was first registrated, so it shouldn't be possible for it to be above the year 2016, when the data was listed at the time. But it's also not possible that the registration year happened as long ago as the year 1000.
<br>
So let's limit the registration year of cars which starts from the year 1960. 

In [14]:
autos = autos[autos["registration_year"].between(1960, 2016)]
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=False).head(20)

2016    0.026616
2015    0.008430
2014    0.013773
2013    0.017677
2012    0.027029
2011    0.034644
2010    0.035427
2009    0.044758
2008    0.046309
2007    0.050557
2006    0.057775
2005    0.062742
2004    0.056238
2003    0.056622
2002    0.054352
2001    0.057021
2000    0.066776
1999    0.063626
1998    0.049749
1997    0.040516
Name: registration_year, dtype: float64

In [15]:
autos['registration_year'].describe()

count    346260.000000
mean       2002.964810
std           6.941372
min        1960.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        2016.000000
Name: registration_year, dtype: float64

In [16]:
autos['registration_year'].value_counts(normalize=True)

2000    0.066776
1999    0.063626
2005    0.062742
2006    0.057775
2001    0.057021
2003    0.056622
2004    0.056238
2002    0.054352
2007    0.050557
1998    0.049749
2008    0.046309
2009    0.044758
1997    0.040516
2010    0.035427
2011    0.034644
1996    0.030064
2012    0.027029
2016    0.026616
1995    0.025989
2013    0.017677
2014    0.013773
1994    0.013418
1993    0.009525
2015    0.008430
1992    0.008364
1991    0.007673
1990    0.007229
1989    0.003636
1988    0.002674
1985    0.002001
1987    0.001897
1986    0.001560
1980    0.001525
1983    0.001233
1984    0.001167
1970    0.000927
1982    0.000921
1978    0.000861
1979    0.000855
1981    0.000774
1972    0.000702
1974    0.000572
1973    0.000566
1971    0.000549
1977    0.000546
1976    0.000505
1960    0.000456
1966    0.000453
1969    0.000439
1975    0.000422
1968    0.000416
1967    0.000390
1965    0.000344
1964    0.000219
1963    0.000217
1961    0.000139
1962    0.000136
Name: registration_year, dtype:

Let's go on to the power_ps column. This column describes the power of the car in PS which literally means 'horse strength'. It is also alternatively known as 'hp' and 'bhp'.

In [17]:
autos["power_ps"].value_counts().sort_index()

0        33238
1           25
2            9
3            8
4           30
         ...  
17932        1
19208        1
19211        1
19312        1
20000        1
Name: power_ps, Length: 764, dtype: int64

So here we see 0 PS car. Is there even a car with an engine that doesn't give off any power? I think we should remove these 0s.
<br>
And also we can see 20000 PS in the bottom? That's also weird. As far as I searched, 8000 PS are for F1 cars. And the most common for daily use cars are 120-150 hp(almost equivalent to 120-150 PS).

In [18]:
autos["power_ps"].describe()

count    346260.000000
mean        117.633567
std         186.842239
min           0.000000
25%          75.000000
50%         109.000000
75%         150.000000
max       20000.000000
Name: power_ps, dtype: float64

As expected, the 75% mark is in the normal power range for daily to use cars. According to this guide, 300 PS in a normal car is still realistic. So, I will set the limit between 1 PS and 300 PS if the highest data price matches the power.

In [26]:
autos[autos["price"] == autos["price"].max()][["name","price","power_ps","brand"]]

Unnamed: 0,name,price,power_ps,brand
56490,Porsche_991,350000,500,porsche
108590,Porsche_911_R,350000,0,porsche
120460,Andere_ISO_Grifo,350000,300,sonstige_autos
330148,Mercedes_Benz_C_63_AMG_7G_TRONIC,350000,457,mercedes_benz


So the highest price car has a power of 500 PS! Not bad, this is still acceptable since it's not far-fetched like 20000 PS. So I'll set the range between 1 and 500.

In [27]:
autos = autos[autos["power_ps"].between(1,500)]
autos["power_ps"].describe()

count    312275.000000
mean        125.947088
std          60.205900
min           1.000000
25%          80.000000
50%         116.000000
75%         150.000000
max         500.000000
Name: power_ps, dtype: float64

Now let's analysis the brand

In [28]:
brand = autos["brand"].value_counts(normalize=True)
brand

volkswagen        0.213264
bmw               0.113528
opel              0.104741
mercedes_benz     0.096963
audi              0.092614
ford              0.067466
renault           0.044759
peugeot           0.030249
fiat              0.024674
seat              0.019111
skoda             0.016287
mazda             0.015384
smart             0.014369
citroen           0.013690
nissan            0.013235
toyota            0.013113
hyundai           0.010142
mini              0.010081
volvo             0.009392
mitsubishi        0.008108
honda             0.007628
sonstige_autos    0.007170
kia               0.006968
alfa_romeo        0.006421
suzuki            0.006305
porsche           0.006267
chevrolet         0.004839
chrysler          0.003769
dacia             0.002565
jeep              0.002216
land_rover        0.002174
subaru            0.002130
daihatsu          0.001973
jaguar            0.001678
saab              0.001531
daewoo            0.001316
lancia            0.001265
r

We can see here the relative frequency of car brands are German car brands like Volkswagen, Mercedes-Benz, BMW, Opel and Audi. Volkwagen account almost twice the cars of BMW. While cars originating from other countries like Lada, Trabant and Lancia which gives off old, retro styles are not too popular.
<br>
<br>
So let's select the top 5 car brands to aggregate by and find out the average price of these 5 favorite brands.

In [29]:
top_5_mean_price = {}

top_5_brands = brand.index[:5]

for b in top_5_brands:
    selected_rows = autos[autos["brand"] == b]
    mean_price = selected_rows["price"].mean()
    top_5_mean_price[b] = mean_price
    
top_5_mean_price

{'volkswagen': 5668.723080619247,
 'bmw': 8665.575736206702,
 'opel': 3157.9535587623823,
 'mercedes_benz': 8628.945506786882,
 'audi': 9335.78033263027}

As we can see the Volkswagen and Opel are much cheaper compared to other famous brands like the BMW, Merc and Audi.
<br>
<br>
While Opel are not that much familiar-sounding to the rest of the world like Mercedes Benz and Audi are, it is still a big brand name like the rest are. Though, seeing that there are a lot less publicity and fame, it might explain why the price are much cheaper than the others.
<br>
<br>
But, let's continue to dig deeper using the kileage to discover whether there's a connection to the price.

In [32]:
top_5_mean_kileage = {}

for b in top_5_brands:
    sel_rows = autos[autos["brand"] == b]
    mean_kileage = sel_rows["kilometer"].mean()
    top_5_mean_kileage[b] = mean_kileage
    
top_5_mean_kileage

{'volkswagen': 128092.9321140592,
 'bmw': 132979.8036782128,
 'opel': 128360.64571358688,
 'mercedes_benz': 130769.17995970805,
 'audi': 129139.03391998894}

I found the average kileage for each car brand, but like this it would be too hard to compare it to the price. Therefore, I will make these datas into a dataframe first to analyze it more efficiently.

In [33]:

bmp_series = pd.Series(top_5_mean_price)
bmm_series = pd.Series(top_5_mean_mileage)


df = pd.DataFrame(bmp_series, columns=["mean_price"])
df["mean_mileage"] = bmm_series


df

Unnamed: 0,mean_price,mean_mileage
volkswagen,5668.723081,128092.932114
bmw,8665.575736,132979.803678
opel,3157.953559,128360.645714
mercedes_benz,8628.945507,130769.17996
audi,9335.780333,129139.03392


Generally speaking, higher mileage should lead to lower prices. The fact that cars covering a distance less than 130,000 km mostly cost less, while above that point cars are more expensive come to tell us that probable kileage doesn't really have a connection to its price.

Though, strictly speaking, there is not a large difference in the kileage in all five types of cars, so killeage plays a little role in the car sales.

# Conclusion

All in all, the most popular car brands sold in the eBay website are brands from Germany. Mercedez Benz, BMW, Opel, Volkswagen and Audi are the most popular ones, and among them Volkswagen - the "People's Car" - prevails. The sales of Volkswagen alone accounts over 20% of the whole sales.

Between the average price and kileage of the top five highest-selling brands, there are little to no connection. Although the affordability and relatively cheap price of the "People's Car" Volkswagen is outstanding, ensuring its fame.

