# Exploring eBay Car Sales Data

The aim of this project is to clean the data and analyze the included used car listings.

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.



In [1]:
import pandas as pd
import numpy as np # Let's start by importing the libraries we need and reading the dataset into pandas.

autos = pd.read_csv('autos.csv', encoding='Latin-1') # It gives an error with UTF-8, we try with Latin-1 and Windows-1252.






In [2]:
autos # A neat feature of jupyter notebook is its ability to render the first few and last few values of any pandas object.

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


## Initial Exploration

**USEFUL TOOLS:**

**DataFrame.info()** gives :
- The total number of rows in the DataFrame.
- The index range of the DataFrame (start and stop values).
- The column names and their corresponding data types.
- The number of non-null values in each column.
- The memory usage of the DataFrame.

**DataFrame.head():** print the first few rows. Five by default.

**DataFrame.describe():** statistical measures for numeric and categorical (include='all') columns, such as count, mean, standard deviation, minimum, maximum, and quartiles.

**If any columns need a closer look:**

- **Series.head()**
- **Series.value_counts()**: counts of unique values.

In [3]:
autos.info()
print('\n')
autos.head() # gives the first five rows by default.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


**Column description:**
- dateCrawled : when this ad was first crawled, all field-values are taken from this date
- name : "name" of the car
- seller : private or dealer
- offerType
- price : the price on the ad to sell the car
- abtest
- vehicleType
- yearOfRegistration : at which year the car was first registered
- gearbox
- powerPS : power of the car in PS
- model
- kilometer : how many kilometers the car has driven
- monthOfRegistration : at which month the car was first registered
- fuelType
- brand
- notRepairedDamage : if the car has a damage which is not repaired yet
- dateCreated : the date for which the ad at ebay was created
- nrOfPictures : number of pictures in the ad
- postalCode
- lastSeenOnline : when the crawler saw this ad last online



In [4]:
autos.describe(include='all') # statistical measures for numeric and categorical (include='all') columns.

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 15:49:30,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


**Exploring individual columns**


In [5]:
print(autos['nrOfPictures'].value_counts())

0    50000
Name: nrOfPictures, dtype: int64


**Initial data observations:**

- vehicleType, gearbox,  model, fuelType, notRepairedDamage: all of them have some null values.

- The *unique* info from Dataframe.describe() shows that there are several columns with just 2 unique values.: seller, OfferType, abtest, Gearbox and notRepairedDamage. 

- 15 columns have string values.

- price and odometer: should be a number for an easier treatment.

- The names of the columns should be homogenus with all caps or lower caps, in order to be easier to work with the data.
- The 'nrOfPictures' column has not useful values.

**Cleaning price and odometer columns and converting them to numerical**

In [6]:
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
print(autos['price'])

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 50000, dtype: int64


In [7]:
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
print(autos['odometer'])

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer, Length: 50000, dtype: int64


In [8]:
autos = autos.rename({'odometer':'odometer_km'},axis=1)

**We will continue exploring the odometer and price columns: Removing Outliers**

In [9]:
print(autos.sort_values('price').loc[52:50,'price'])

# NOTE! REMINDER! If we use .loc the selection corresponds to the labels and indexes, (each row has an index).
# If we use .loc it will print the rows corresponding the indexes from 52 to 50, but it won't work if we want 
# to print the rows from 50 to 52, it is because, since it was sorted (by 'price'), the row with index 50 is 
# positioned below the one with index 52 and not otherwise. 

52       3500
32599    3500
11709    3500
49735    3500
13789    3500
         ... 
34295    5999
11360    5999
18238    5999
48715    5999
50       5999
Name: price, Length: 7595, dtype: int64


In [10]:
#print(autos)

In [11]:
#pd.set_option('display.max_rows', None)

In [12]:
print(autos.sort_values('price').iloc[50:51, 1]) # We use .iloc in order to print by the position of the rows 
# and not the indexes or labels. 

15203    Opel_Zafira_OPC_LET/LEH_Turbo
Name: name, dtype: object


In [13]:
print(autos['price'].value_counts()) # in the output the first column represents the index of the series, 
#which is also the unique 'price' values in the 'autos' dataframe.

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price, Length: 2357, dtype: int64


In [14]:
print(autos['price'].value_counts().sort_index(ascending=True)) # now we sort the indexes that correspond 
#to the unique price values too.

0           1421
1            156
2              3
3              1
5              2
            ... 
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price, Length: 2357, dtype: int64


In [15]:
print(autos['price'].value_counts().sort_index(ascending=True).tail(20))

197000      1
198000      1
220000      1
250000      1
259000      1
265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64


In order to not remove high prices regarding brands like Ferrari, we will remove as outliers rows of data with prices higher than 10000000. 

In [16]:
print(autos[autos['price'] < 100]['price'].value_counts().sort_index(ascending=True))
print(autos['price'].dtype)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
40       6
45       4
47       1
49       4
50      49
55       2
59       1
60       9
65       5
66       1
70      10
75       5
79       1
80      15
89       1
90       5
99      19
Name: price, dtype: int64
int64


In [17]:
print(autos[(autos['price'] > 0) & (autos['price'] < 100)]['price'].describe())

count    341.000000
mean      29.105572
std       32.610455
min        1.000000
25%        1.000000
50%       10.000000
75%       50.000000
max       99.000000
Name: price, dtype: float64


There are prices that are too low, but for now we will only remove the rows with prices lesser than 100 dolars. Less than 100 dolars for a car seems unrealistic.

What about the odometer column?

In [18]:
autos.info() # In order to check the name of the columns. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  int64 
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer_km          50000 non-null  int64 
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [19]:
print(autos['odometer_km'].unique())

[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000 100000  40000]


In [20]:
print(autos['odometer_km'].unique().shape)

(13,)


In [21]:
print(autos['odometer_km'].describe())

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


In [22]:
print(autos['odometer_km'].value_counts())

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64


In [23]:
print(autos['odometer_km'].value_counts().sort_index(ascending=True))

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64


The odometer_km column doesn't appear to have any outlier.

## Removing outliers:

We will remove prices lesser than 100 and higher than 9999999:

In [24]:
# autos = autos[(autos['price'] > 99) & (autos['price'] < 9999999)] we could use this line of code, but instead we 
# will use the between() method.

autos = autos[autos['price'].between(99,9999999)] 
# the between() method checks if the 'price' column falls between the specified values 
# of 99 and 9999999 (inclusive). The between() method returns a boolean Series, 
# which is used to filter the 'autos' DataFrame.

In [25]:
list_price =list(autos['price'].unique()) # In order to visualize all the values, I convert the unique() output into a list.
list_price.sort(reverse=True) # We sorted the values, to easily check them.
print(list_price)



[3890000, 1300000, 1234566, 999999, 999990, 350000, 345000, 299000, 295000, 265000, 259000, 250000, 220000, 198000, 197000, 194000, 190000, 180000, 175000, 169999, 169000, 163991, 163500, 155000, 151990, 145000, 139997, 137999, 135000, 130000, 129000, 128000, 120000, 119900, 119500, 116000, 115991, 115000, 114400, 109999, 105000, 104900, 99900, 99000, 98500, 94999, 93911, 93000, 89900, 89000, 88900, 86500, 85000, 84997, 84000, 83000, 82987, 80000, 79999, 79980, 79933, 79500, 78911, 76997, 75997, 75900, 75000, 74999, 74900, 73996, 73900, 73500, 72900, 72600, 72500, 71000, 70850, 70000, 69999, 69997, 69993, 69900, 69500, 68900, 68750, 68500, 68300, 68000, 67911, 67000, 66964, 66500, 65990, 65700, 65699, 65000, 64999, 64990, 64900, 64600, 64500, 64280, 63999, 63499, 63000, 62900, 62000, 61999, 61950, 61900, 61500, 60000, 59850, 59500, 59000, 58900, 58700, 58500, 57800, 56900, 56800, 56500, 56000, 55999, 55996, 55900, 55800, 55555, 55500, 55000, 54990, 54500, 53900, 53500, 53000, 52911, 52

In [26]:
print(autos['price'].value_counts().sort_index(ascending=True).tail(20))

169999     1
175000     1
180000     1
190000     1
194000     1
197000     1
198000     1
220000     1
250000     1
259000     1
265000     1
295000     1
299000     1
345000     1
350000     1
999990     1
999999     2
1234566    1
1300000    1
3890000    1
Name: price, dtype: int64


We check that the highest price is equal to 3890000. It makes sense, because it corresponds to the Ferrari brand. There are 3 suspicius values, 999990,999999 and 1234566, we will check them one by one:

In [27]:
print(autos[autos['price'] == 999990]['name'])
print('\n')
print(autos[autos['price'] == 999999]['name'])
print('\n')
print(autos[autos['price'] == 1234566]['name'])

37585    Volkswagen_Jetta_GT
Name: name, dtype: object


514      Ford_Focus_Turnier_1.6_16V_Style
43049                       2_VW_Busse_T3
Name: name, dtype: object


22947    Bmw_530d_zum_ausschlachten
Name: name, dtype: object


The so-high prices do not correspond with these three brands and models. For that reason I will remove these three rows: 

In [28]:
autos = autos[(autos['price'] != 999990) & (autos['price'] != 1234566) & (autos['price'] != 999999)]
# another approach is: autos = autos[(autos['price'] < 999990) | (autos['price'] > 1234566)]
# The | symbol means 'or'.


In [29]:
print(autos['price'].value_counts().sort_index(ascending=True).tail(20))

163500     1
163991     1
169000     1
169999     1
175000     1
180000     1
190000     1
194000     1
197000     1
198000     1
220000     1
250000     1
259000     1
265000     1
295000     1
299000     1
345000     1
350000     1
1300000    1
3890000    1
Name: price, dtype: int64


In [30]:
print(autos['price'].value_counts().sort_index(ascending=True).head(20))

99      19
100    134
110      3
111      2
115      2
117      1
120     39
122      1
125      8
129      1
130     15
135      1
139      1
140      9
145      2
149      7
150    224
156      2
160      8
170      7
Name: price, dtype: int64


It is interesting to see that 224 cars have a very low price: 150 USD. But as we said, for now, we keep the cars with prices higher than 100 USD.

In [31]:
print(autos['price'].describe())

count    4.824500e+04
mean     6.035405e+03
std      2.073154e+04
min      9.900000e+01
25%      1.250000e+03
50%      3.000000e+03
75%      7.499000e+03
max      3.890000e+06
Name: price, dtype: float64


## After removing the price outliers:

We have now a dataframe with price values which range goes from 100 to 3890000 dolars.
75% of the cars have prices of less or equal than 7500 dolars.
25% of the cars have prices of less or equal than 1250 dolars.

## Exploring the date columns
There are 5 columns that represent different date information.

- `dateCrawled`: added by the crawler
- `lastSeen`: added by the crawler
- `dateCreated`: from the website
- `monthOfRegistration`: from the website
- `yearOfRegistration`: from the website

In [32]:
print('dateCrawled: ',autos['dateCrawled'].dtype)
print('lastSeen: ',autos['lastSeen'].dtype)
print('dateCreated: ',autos['dateCreated'].dtype)
print('monthOfRegistration: ',autos['monthOfRegistration'].dtype)
print('yearOfRegistration: ',autos['yearOfRegistration'].dtype)


dateCrawled:  object
lastSeen:  object
dateCreated:  object
monthOfRegistration:  int64
yearOfRegistration:  int64


We observe that three of the date columns contain values of string type. To facilitate data manipulation, we intend to convert them into integer format.

First, we are going to print a few rows of each column to examine their structure.

In [33]:
print(autos['dateCrawled'].head(5))
print('\n')
print(autos['lastSeen'].head(5))
print('\n')
print(autos['dateCreated'].head(5))
print('\n')
print(autos['monthOfRegistration'].head(5))
print('\n')
print(autos['yearOfRegistration'].head(5))

0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: dateCrawled, dtype: object


0    2016-04-06 06:45:54
1    2016-04-06 14:45:08
2    2016-04-06 20:15:37
3    2016-03-15 03:16:28
4    2016-04-01 14:38:50
Name: lastSeen, dtype: object


0    2016-03-26 00:00:00
1    2016-04-04 00:00:00
2    2016-03-26 00:00:00
3    2016-03-12 00:00:00
4    2016-04-01 00:00:00
Name: dateCreated, dtype: object


0    3
1    6
2    7
3    6
4    7
Name: monthOfRegistration, dtype: int64


0    2004
1    1997
2    2009
3    2007
4    2003
Name: yearOfRegistration, dtype: int64


We can examine the columns by printing them altogether:

In [34]:
autos[['dateCrawled','lastSeen','dateCreated']][0:5]

Unnamed: 0,dateCrawled,lastSeen,dateCreated
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00
3,2016-03-12 16:58:10,2016-03-15 03:16:28,2016-03-12 00:00:00
4,2016-04-01 14:38:50,2016-04-01 14:38:50,2016-04-01 00:00:00


We observe that the three columns have the same structure:
Year-month-day hour:minute:seconds.

**We are going to check the date distributions of the three columns:**

In [35]:
autos['dateCrawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index() # In order to print only the first 10 characters we use the methoth .str().
# Since after the value_counts() method the dates are the indexes of the series, we can order the data by date by using the sort_index().
# The normalize=True is specified in order to have percentages instead of counts. And dropna=False, is specified to include the null values, so we can check the percentage of null values in that column.

2016-03-05    0.025371
2016-03-06    0.014033
2016-03-07    0.036045
2016-03-08    0.033206
2016-03-09    0.032998
2016-03-10    0.032273
2016-03-11    0.032604
2016-03-12    0.036895
2016-03-13    0.015670
2016-03-14    0.036646
2016-03-15    0.034304
2016-03-16    0.029454
2016-03-17    0.031527
2016-03-18    0.012893
2016-03-19    0.034760
2016-03-20    0.037786
2016-03-21    0.037227
2016-03-22    0.032915
2016-03-23    0.032294
2016-03-24    0.029454
2016-03-25    0.031506
2016-03-26    0.032314
2016-03-27    0.031112
2016-03-28    0.034947
2016-03-29    0.034097
2016-03-30    0.033724
2016-03-31    0.031858
2016-04-01    0.033703
2016-04-02    0.035589
2016-04-03    0.038595
2016-04-04    0.036584
2016-04-05    0.013058
2016-04-06    0.003171
2016-04-07    0.001389
Name: dateCrawled, dtype: float64

In [36]:
autos['lastSeen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001078
2016-03-06    0.004311
2016-03-07    0.005431
2016-03-08    0.007317
2016-03-09    0.009597
2016-03-10    0.010633
2016-03-11    0.012395
2016-03-12    0.023774
2016-03-13    0.008871
2016-03-14    0.012623
2016-03-15    0.015877
2016-03-16    0.016437
2016-03-17    0.028086
2016-03-18    0.007317
2016-03-19    0.015794
2016-03-20    0.020665
2016-03-21    0.020562
2016-03-22    0.021349
2016-03-23    0.018593
2016-03-24    0.019774
2016-03-25    0.019132
2016-03-26    0.016665
2016-03-27    0.015546
2016-03-28    0.020852
2016-03-29    0.022282
2016-03-30    0.024707
2016-03-31    0.023816
2016-04-01    0.022842
2016-04-02    0.024894
2016-04-03    0.025122
2016-04-04    0.024541
2016-04-05    0.125091
2016-04-06    0.221930
2016-04-07    0.132097
Name: lastSeen, dtype: float64

In [37]:
autos['dateCreated'].str[:10].value_counts(normalize=True, dropna=False).head(40)

2016-04-03    0.038843
2016-03-20    0.037848
2016-03-21    0.037455
2016-04-04    0.036936
2016-03-12    0.036729
2016-04-02    0.035278
2016-03-14    0.035278
2016-03-28    0.035050
2016-03-07    0.034781
2016-03-29    0.034055
2016-03-15    0.034035
2016-04-01    0.033682
2016-03-19    0.033641
2016-03-30    0.033537
2016-03-08    0.033206
2016-03-09    0.033081
2016-03-11    0.032915
2016-03-22    0.032729
2016-03-26    0.032376
2016-03-23    0.032128
2016-03-10    0.031983
2016-03-31    0.031900
2016-03-25    0.031630
2016-03-17    0.031195
2016-03-27    0.031029
2016-03-16    0.029951
2016-03-24    0.029392
2016-03-05    0.022925
2016-03-13    0.017038
2016-03-06    0.015297
2016-03-18    0.013577
2016-04-05    0.011794
2016-04-06    0.003254
2016-03-04    0.001492
2016-04-07    0.001244
2016-03-03    0.000871
2016-02-28    0.000207
2016-02-29    0.000166
2016-02-27    0.000124
2016-03-02    0.000104
Name: dateCreated, dtype: float64

**First observations of the 'dateCrawled', 'lastSeen' and 'dateCreated' columns:**

- 'dateCrawled': the data was crawled from 2016-03-05 to 2016-04-07.
- 'lastSeen' : 47% of the ads were last seen from 2016-04-05 to 2016-04-07. The range goes from  from 2016-03-05 to 2016-04-07 as the previous column.
- 'dateCreated' : the ads were created from 2015-06-11 to 2016-04-07.
- There are no null values.

In [38]:
autos['yearOfRegistration'].value_counts().sort_index().head(40)

1000     1
1001     1
1111     1
1800     2
1910     2
1927     1
1929     1
1931     1
1934     2
1937     4
1938     1
1939     1
1941     2
1943     1
1948     1
1950     1
1951     2
1952     1
1953     1
1954     2
1955     2
1956     4
1957     2
1958     4
1959     6
1960    22
1961     6
1962     4
1963     8
1964    12
1965    17
1966    22
1967    26
1968    26
1969    19
1970    37
1971    26
1972    33
1973    23
1974    24
Name: yearOfRegistration, dtype: int64

In [39]:
autos['yearOfRegistration'].value_counts().sort_index().tail(40)

1990     332
1991     338
1992     368
1993     420
1994     627
1995    1194
1996    1357
1997    1929
1998    2344
1999    2879
2000    3105
2001    2629
2002    2477
2003    2694
2004    2699
2005    2912
2006    2669
2007    2273
2008    2210
2009    2080
2010    1587
2011    1618
2012    1309
2013     801
2014     662
2015     381
2016    1203
2017    1384
2018     468
2019       2
2800       1
4100       1
4500       1
4800       1
5000       3
5911       1
6200       1
8888       1
9000       1
9999       3
Name: yearOfRegistration, dtype: int64

In [40]:
autos[autos['yearOfRegistration'] == 1000]['yearOfRegistration']

22316    1000
Name: yearOfRegistration, dtype: int64

In [41]:
autos['yearOfRegistration'].describe()

count    48245.000000
mean      2004.729464
std         87.878422
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: yearOfRegistration, dtype: float64

The 'yearOfRegistration' column contains registration years that do not correspond to valid years for car registration. The dataframe includes registration years prior to the invention of the first car in 1886, as well as years ranging from 2800 to 9999. The following is a list of these invalid years, along with the number of rows in the dataframe that have each corresponding registration year value:

- 1000 --> 1 row
- 1001 --> 1 row
- 1111 --> 1 row
- 1800 --> 2 rows
- 2800 -->      1 row
- 4100 -->      1 row
- 4500 -->     1 row
- 4800 -->      1 row
- 5000 -->      3 rows
- 5911 -->      1 row
- 6200 -->      1 row
- 8888 -->      1 row
- 9000 -->      1 row
- 9999 -->      3 rows


**We will consider the years between 1910 and 2019 as valid for the 'yearOfRegistration' column:**

In [42]:
autos = autos[(autos['yearOfRegistration'] >= 1910) & (autos['yearOfRegistration'] <= 2019)]

In [43]:
autos['yearOfRegistration'].value_counts().sort_index().head(10)

1910    2
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
Name: yearOfRegistration, dtype: int64

In [44]:
autos['yearOfRegistration'].value_counts().sort_index().tail(10)

2010    1587
2011    1618
2012    1309
2013     801
2014     662
2015     381
2016    1203
2017    1384
2018     468
2019       2
Name: yearOfRegistration, dtype: int64

In [45]:
autos['yearOfRegistration'].describe()

count    48226.000000
mean      2003.489093
std          7.511783
min       1910.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2019.000000
Name: yearOfRegistration, dtype: float64

We now have a dataframe of 48230 rows. With registration years that go from 1910 to 2019. Most of the cars had been registered before 2008. The majority of the cars fall within the range from 1999 to 2008.

In [46]:
print('Brands: ',autos['brand'].unique())
print('\n')
print('Number of brands: ',len(autos['brand'].unique()))

Brands:  ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']


Number of brands:  40


'sonstige_autos' in english means 'other autos', we are going to check the prices and the brands for this especific 'brand':

In [47]:
autos[autos['brand'] == 'sonstige_autos']['name'].unique()

array(['Corvette_C3_Coupe_T_Top_Crossfire_Injection',
       'Ssangyong_Actyon_SUV_2.0_xdi2wd_55000_km', 'Ssanyong_Rexton_2.7',
       'MG_MGB_GT', '3_Fahrzeug_zu_Verkaufen',
       'Dodge_Nitro_4.0_Automatik_R/T', 'Proton_PKW',
       'Gut_erhaltene_Alufelgen', 'Wartburg_1.3',
       'Werkaufen_meine_Iveco_deily', 'Pontiac_Firebird',
       'Cadillac_SRX_3.6_V6_AWD',
       'Pontiac_Firebird_3.4___org._30.000_km___LPG',
       'Pontiac_Firebird_2_8L_V6_Targa__Tuev_07/2016',
       'Andere_russ._LuAZ_967M_Schwimmwagen_NVA_GSSD_aehnl_GAZ46',
       'Buick_Riviera', 'Wartburg_353',
       'Dodge__RAM_Laramie_1500_V8_HEMI_Crew_Cab_AHK',
       '1936_Daimler_Fifteen_Special_Projekt',
       'Dodge_RAM_B300_Mowag_Mowag_Dodge_V8_selten_Neuaufbau_US_Car_Van',
       'DKW_1000_S_Coupe', 'VW_SHARAN_TDI_1.9_TOP_ZUSTAND', 'MG_MGF_1.8i',
       'MICROCAR_M.GO_Dynamic_Mopedauto_Leichtkraftfahrzeug_NEU',
       'Cadillac_Deville',
       'Verkaufe_unser_Wohnmobilbegleitfahrzeug_mit_Transportanhaenge

We see that there are many different brands, 'sonstige_autos' includes brands like Tesla, Ferrari, Maserati, Cadillac, etc. 
We will check the lowest and highest prices for these 'sonstige_autos':

We will check the number of counts for each 'name' (for the 'sonstige_autos' data) that corresponds to an especific model of a brand:

In [48]:
autos[autos['brand'] == 'sonstige_autos']['name'].value_counts(dropna=False).head(40)

Dodge_RAM                                              6
MG_MGF_1.8i                                            5
Hummer_H2                                              4
Dodge_Nitro                                            3
Abarth_Grande_Punto                                    3
Cadillac_Deville                                       3
Suche_ein_Auto                                         3
Wartburg_353_W                                         2
NSU_Andere                                             2
Triumph_TR3_A                                          2
Lexus_IS_250_Sport_Line                                2
MG_MGB                                                 2
Plymouth_Andere                                        2
Dodge_Caliber_2.0_CVT_SXT                              2
Piaggio_Porter                                         2
Pontiac_Bonneville                                     2
Lexus_IS_220d_DPNR_Luxury_Line                         2
Abarth_500                     

In [49]:
autos[autos['brand'] == 'sonstige_autos']['name'].value_counts(dropna=False).tail(20)

Wir_kaufen_Ihren_Gebrauchten_fuer_einen_sehr_guten_Preis    1
Lexus_IS_200_Sport_Standheizung_Top                         1
VW_SHARAN_TDI_1.9_TOP_ZUSTAND                               1
Iveco_35_S_13_V_L_Cool_TÜV_NEU                              1
Rohkaraosserie_MG_B_GT                                      1
MASERATI_QUATTROPORTE                                       1
Verkaufe_Polo_Farbe_rot                                     1
Ich_besorge_Ihnen_ihr_naechstes_Auto                        1
Suche_PKW_mit_1_Jahr_TÜV                                    1
Abarth_500_Esseesse                                         1
Suche_Kleinwagen_bis_Maximal_1200                          1
Corvette_C4                                                 1
Wartburg_Wartburg_353__2_Takter__Oldtimergutachten_         1
Gmc_vandura                                                 1
Maserati_Biturbo                                            1
Lexus_CT_200h                                               1
Ferrari_

In [50]:
autos[autos['brand'] == 'sonstige_autos']['price'].sort_values().head(20)

35325    100
461      100
41053    130
36282    140
11283    199
19425    200
15996    200
13543    250
47444    250
20420    250
359      299
39143    300
11893    300
35591    300
4562     300
41537    350
31295    350
24420    370
8598     400
18541    400
Name: price, dtype: int64

In [51]:
autos[autos['brand'] == 'sonstige_autos']['price'].sort_values().tail(20) # We check the highest prices of the brand with the most expensive cars: 'sonstige_autos'

29237      46900
27481      47000
33189      48700
34589      48850
2698       54500
40975      55000
30121      59850
16006      60000
49941      62000
4045       72600
47406      79500
8446       79999
3283       80000
16964     105000
49391     109999
22060     114400
28090     194000
14715     345000
7814     1300000
47634    3890000
Name: price, dtype: int64

It is very interesting the span of prices in this 'sonstige_autos' list of prices. It goes from 100 USD to 3890000 USD. 

In [52]:
print(autos[autos['price'] == 1300000]['name'])
print(autos[autos['price'] == 3890000]['name'])


7814    Ferrari_F40
Name: name, dtype: object
47634    Ferrari_FXX
Name: name, dtype: object


*The highest prices make sense, both correspond to the Ferrari brand.*

**Below, we will check the percentage of the brands of the 'brand' column in the dataset:**

In [53]:
autos['brand'].value_counts(normalize=True, dropna=False)

volkswagen        0.212976
bmw               0.108759
opel              0.108344
mercedes_benz     0.095923
audi              0.085991
ford              0.069589
renault           0.047858
peugeot           0.029445
fiat              0.026044
seat              0.018911
skoda             0.016070
nissan            0.015324
mazda             0.015261
smart             0.014308
citroen           0.014142
toyota            0.012670
hyundai           0.009953
sonstige_autos    0.009414
volvo             0.009020
mini              0.008647
mitsubishi        0.008128
honda             0.008025
kia               0.007112
alfa_romeo        0.006635
suzuki            0.005889
porsche           0.005806
chevrolet         0.005640
chrysler          0.003484
dacia             0.002675
daihatsu          0.002509
jeep              0.002219
subaru            0.002074
land_rover        0.002053
saab              0.001638
daewoo            0.001555
jaguar            0.001493
rover             0.001348
t

**Mean prices of the brands in the dataframe**

Even though we have brands with a low percentage of cars in the dataset,  since we have only 40 brands, we don't need to select a subset. We consider that we can perfectly work with all of them. 

In [54]:
# First I will practice with a not very elegant for loop. 
# And afterwards I will repeat the calculus with a more efficient approach.
brands = autos['brand'].unique()

mean_prices_brand = {}

for brand in brands:
    price = 0
    count = 0
    index = -1
    for element in autos['brand']:
        index += 1
        if brand == element:
            price += autos.iloc[index, autos.columns.get_loc('price')] 
            #since .iloc only works with integer positions, I used that method to get the integer position of the price column.
            count +=1 
    mean_prices_brand[brand] = price / count
    
            
print(mean_prices_brand)            
            
   
            

{'peugeot': 3086.930281690141, 'bmw': 8307.007435653002, 'volkswagen': 5364.202122480771, 'smart': 3538.344927536232, 'ford': 3756.9919547079858, 'chrysler': 3539.9166666666665, 'seat': 4353.146929824561, 'renault': 2448.8635181975737, 'mercedes_benz': 8570.76869865975, 'audi': 9259.510248372317, 'sonstige_autos': 24016.248898678416, 'opel': 2968.859330143541, 'mazda': 4075.319293478261, 'porsche': 46764.2, 'mini': 10566.824940047962, 'toyota': 5148.0032733224225, 'dacia': 5897.736434108527, 'nissan': 4681.94046008119, 'jeep': 11590.214953271028, 'saab': 3183.493670886076, 'volvo': 4911.680459770115, 'mitsubishi': 3429.8673469387754, 'jaguar': 11844.041666666666, 'fiat': 2806.984076433121, 'skoda': 6394.309677419355, 'subaru': 4019.07, 'kia': 5923.288629737609, 'citroen': 3772.460410557185, 'chevrolet': 6692.60294117647, 'hyundai': 5405.15625, 'honda': 4010.4728682170544, 'daewoo': 1093.6, 'suzuki': 4166.767605633803, 'trabant': 1843.5384615384614, 'land_rover': 18934.272727272728, 'al

*Now I will calculate the mean prices by using fewer lines of code:*

In [55]:
brands = autos['brand'].unique()

mean_prices_brand = {}

for brand in brands:
    mean_prices_brand[brand] = autos[autos['brand'] == brand]['price'].mean()
    
            
print(mean_prices_brand)          

{'peugeot': 3086.930281690141, 'bmw': 8307.007435653002, 'volkswagen': 5364.202122480771, 'smart': 3538.344927536232, 'ford': 3756.9919547079858, 'chrysler': 3539.9166666666665, 'seat': 4353.146929824561, 'renault': 2448.8635181975737, 'mercedes_benz': 8570.76869865975, 'audi': 9259.510248372317, 'sonstige_autos': 24016.248898678416, 'opel': 2968.859330143541, 'mazda': 4075.319293478261, 'porsche': 46764.2, 'mini': 10566.824940047962, 'toyota': 5148.0032733224225, 'dacia': 5897.736434108527, 'nissan': 4681.94046008119, 'jeep': 11590.214953271028, 'saab': 3183.493670886076, 'volvo': 4911.680459770115, 'mitsubishi': 3429.8673469387754, 'jaguar': 11844.041666666666, 'fiat': 2806.984076433121, 'skoda': 6394.309677419355, 'subaru': 4019.07, 'kia': 5923.288629737609, 'citroen': 3772.460410557185, 'chevrolet': 6692.60294117647, 'hyundai': 5405.15625, 'honda': 4010.4728682170544, 'daewoo': 1093.6, 'suzuki': 4166.767605633803, 'trabant': 1843.5384615384614, 'land_rover': 18934.272727272728, 'al

*Effectively, both results are the same.*

**Now that we have the mean price values, we are going to sort them:**

NOTE: 

The sorted() method in Python is introduced as the tool to sort iterable data such as lists, tuples, and dictionaries. However, by default, it only sorts by keys.

To sort a dictionary by its values, a more complex approach using the sorted() method is presented. The key steps are as follows:

a. Pass the dictionary to the sorted() method as the first argument.

b. Use the items() method on the dictionary to retrieve its keys and values as a sequence of tuples.

c. Specify a key function using a lambda function that extracts the values from the tuples. The key function is responsible for defining the sorting criterion.

In [56]:
sorted_mean_prices_brand = sorted(mean_prices_brand.items(), key = lambda x:x[1])
# The key parameter is set to lambda x: x[1], which specifies that the sorting should be based on the second element (value) of each pair.

In [57]:
for element in sorted_mean_prices_brand:
    print(element[0],':',element[1] )

daewoo : 1093.6
rover : 1586.4923076923078
daihatsu : 1641.2644628099174
trabant : 1843.5384615384614
renault : 2448.8635181975737
lada : 2647.7241379310344
fiat : 2806.984076433121
opel : 2968.859330143541
peugeot : 3086.930281690141
saab : 3183.493670886076
lancia : 3240.703703703704
mitsubishi : 3429.8673469387754
smart : 3538.344927536232
chrysler : 3539.9166666666665
ford : 3756.9919547079858
citroen : 3772.460410557185
honda : 4010.4728682170544
subaru : 4019.07
alfa_romeo : 4054.471875
mazda : 4075.319293478261
suzuki : 4166.767605633803
seat : 4353.146929824561
nissan : 4681.94046008119
volvo : 4911.680459770115
toyota : 5148.0032733224225
volkswagen : 5364.202122480771
hyundai : 5405.15625
dacia : 5897.736434108527
kia : 5923.288629737609
skoda : 6394.309677419355
chevrolet : 6692.60294117647
bmw : 8307.007435653002
mercedes_benz : 8570.76869865975
audi : 9259.510248372317
mini : 10566.824940047962
jeep : 11590.214953271028
jaguar : 11844.041666666666
land_rover : 18934.272727

As expected, the most expensive 'brands' are 'sonstige_autos' ('sonstige_autos' means 'other autos' in english, and contains very expensive brands as Ferrari or Maserati), and 'porsche'. And the cheapest brands are 'daewoo' and 'rover'.



**For the 8 most expensive brands we will try to find a correlation between mileage and price:**

NOTE: The prices for Porsche and those listed under 'sonstige_autos' (other brands or especific models) already make sense without further investigation. These high prices appear to be primarily associated with these expensive brands, such as Porsche and the various brands grouped under 'sonstige_autos'. However, we will check for the 8 highest prices ,included the former ones, if there is any correlation with the mileage. Mileage is a potential factor in determining the market value of luxury brands. Mileage can impact the price of a used car, as lower mileage is often associated with better condition and potentially higher value.

*In order to facilitate the analysis, first of all, we will convert the mean_prices_brand dictionary into a series object, and then into a dataframe object of one column. Afterwards, we will create another series object from a dictionary with the mean mileage for each brand and append that column to the previous dataframe.*

In [62]:
mpb_series = pd.Series(mean_prices_brand) # we use the pandas Series constructor.
print(mpb_series)
# The keys of the dictionary become the indexes of the series.

peugeot            3086.930282
bmw                8307.007436
volkswagen         5364.202122
smart              3538.344928
ford               3756.991955
chrysler           3539.916667
seat               4353.146930
renault            2448.863518
mercedes_benz      8570.768699
audi               9259.510248
sonstige_autos    24016.248899
opel               2968.859330
mazda              4075.319293
porsche           46764.200000
mini              10566.824940
toyota             5148.003273
dacia              5897.736434
nissan             4681.940460
jeep              11590.214953
saab               3183.493671
volvo              4911.680460
mitsubishi         3429.867347
jaguar            11844.041667
fiat               2806.984076
skoda              6394.309677
subaru             4019.070000
kia                5923.288630
citroen            3772.460411
chevrolet          6692.602941
hyundai            5405.156250
honda              4010.472868
daewoo             1093.600000
suzuki  

In [64]:
mpb_dataframe = pd.DataFrame(mpb_series, columns=['mean_price'])


In [66]:
mpb_dataframe.sort_values('mean_price')

Unnamed: 0,mean_price
daewoo,1093.6
rover,1586.492308
daihatsu,1641.264463
trabant,1843.538462
renault,2448.863518
lada,2647.724138
fiat,2806.984076
opel,2968.85933
peugeot,3086.930282
saab,3183.493671


In [68]:
# we create the dictionary with the brand and their mean mileage. 
brands = autos['brand'].unique()

mean_mileage_brand = {}

for brand in brands:
    mean_mileage_brand[brand] = autos[autos['brand'] == brand]['odometer_km'].mean()
    
            
print(mean_mileage_brand)          

{'peugeot': 127316.9014084507, 'bmw': 132803.62249761677, 'volkswagen': 129059.48787849284, 'smart': 100833.33333333333, 'ford': 124360.8462455304, 'chrysler': 133125.0, 'seat': 122149.12280701754, 'renault': 128279.89601386481, 'mercedes_benz': 131079.76653696498, 'audi': 129604.53339763684, 'sonstige_autos': 90892.07048458149, 'opel': 129527.27272727272, 'mazda': 124959.23913043478, 'porsche': 98375.0, 'mini': 89100.71942446043, 'toyota': 116219.31260229132, 'dacia': 84728.68217054264, 'nissan': 118707.71312584574, 'jeep': 127102.80373831776, 'saab': 143670.88607594935, 'volvo': 138839.0804597701, 'mitsubishi': 127053.57142857143, 'jaguar': 125763.88888888889, 'fiat': 117472.13375796178, 'skoda': 111051.6129032258, 'subaru': 126100.0, 'kia': 112521.86588921283, 'citroen': 120029.32551319648, 'chevrolet': 100514.70588235294, 'hyundai': 107239.58333333333, 'honda': 123397.93281653746, 'daewoo': 121266.66666666667, 'suzuki': 108485.91549295775, 'trabant': 56692.307692307695, 'land_rover

In [71]:
# We convert the dictionary into a series:
mmb_series = pd.Series(mean_mileage_brand)
print(mmb_series)

peugeot           127316.901408
bmw               132803.622498
volkswagen        129059.487878
smart             100833.333333
ford              124360.846246
chrysler          133125.000000
seat              122149.122807
renault           128279.896014
mercedes_benz     131079.766537
audi              129604.533398
sonstige_autos     90892.070485
opel              129527.272727
mazda             124959.239130
porsche            98375.000000
mini               89100.719424
toyota            116219.312602
dacia              84728.682171
nissan            118707.713126
jeep              127102.803738
saab              143670.886076
volvo             138839.080460
mitsubishi        127053.571429
jaguar            125763.888889
fiat              117472.133758
skoda             111051.612903
subaru            126100.000000
kia               112521.865889
citroen           120029.325513
chevrolet         100514.705882
hyundai           107239.583333
honda             123397.932817
daewoo  

In [75]:
# And then we assign this Series object as a new column in the preious Dataframe:
mpb_dataframe['mean_mileage'] = mmb_series


In [82]:
mpb_dataframe.sort_values('mean_price')

Unnamed: 0,mean_price,mean_mileage
daewoo,1093.6,121266.666667
rover,1586.492308,138230.769231
daihatsu,1641.264463,115619.834711
trabant,1843.538462,56692.307692
renault,2448.863518,128279.896014
lada,2647.724138,85000.0
fiat,2806.984076,117472.133758
opel,2968.85933,129527.272727
peugeot,3086.930282,127316.901408
saab,3183.493671,143670.886076


Based on only this table, we don't see a clear correlation between price and mileage, we should make a scatter plot to look for trends and correlations. There are 6 brands with less than 100000 km, among them, we have 3 brands in the group of the 10 most expensive brands, it does not give much information to make a conclusion. Regarding the 8 most expensive brands, I don't get any conlusions based only in this table either. Two of the most expensive brands have less than 100000 km, however, those brands are tipically expensive.