# Exploring eBay Car Sales Data

This project involves exploring and analyzing a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset, originally scraped and modified to resemble real-world, uncleaned data, includes information on car listings such as price, model, registration year, odometer readings, and more.

The goal of this project is to clean the dataset and perform initial analyses, using pandas for data manipulation. This project also highlights some of the advantages of working in a Jupyter environment.

In [10]:
import pandas as pd
import numpy as np

autos = pd.read_csv("Dataset/autos.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 732: invalid continuation byte

We noticed some errors when we tried to open the file with UTF-8 encoding. 

In [None]:
autos = pd.read_csv("Dataset/autos.csv", encoding="latin1")

In [None]:
autos.info()


In [None]:
autos.head()

### Dataset Overview

The dataset contains **371,528** rows and **20 columns**, which represent various attributes of used car . Here are some key observations:

- **Data Completeness**: 
  - Most columns are fully populated, but some contain missing values, notably:
    - `vehicleType` (333,659 non-null values)
    - `gearbox` (351,319 non-null values)
    - `model` (351,044 non-null values)
    - `fuelType` (338,142 non-null values)
    - `notRepairedDamage` (299,468 non-null values)
  
- **Data Types**:
  - The dataset includes a mix of numerical (`int64`) and categorical (`object`) data.
  - Key numerical fields include `price`, `powerPS`, `yearOfRegistration`, `kilometer`, and `postalCode`.
  - Categorical data includes fields like `name`, `seller`, `offerType`, `vehicleType`, and `fuelType`.




### Cleaning Column Names

**The column names use camelcase instead of Python's preferred snakecase**

In [11]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [12]:
new_columns_name = (['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'kilometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'])

In [13]:
autos.columns = new_columns_name

In [14]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


**Changing column names to snake_case improves readability and consistency, making data processing and analysis easier with tools like Pandas. This convention helps avoid case-related errors and makes the code more maintainable.**

###  Initial Exploration and Cleaning

For Price 

In [None]:
autos.head()

In [None]:
autos.info()

In [None]:
autos["price"].shape

In [None]:
autos["price"].describe

In [15]:
autos = autos.loc[autos["price"] != 0]

**Entries with a price of 0 were removed as they often represent data entry errors or missing values, which could skew the analysis and statistics by introducing non-representative values of the used car market.**


In [16]:
autos['price'].value_counts().sort_index(ascending=True)

price
1             1189
2               12
3                8
4                1
5               26
              ... 
32545461         1
74185296         1
99000000         1
99999999        15
2147483647       1
Name: count, Length: 5596, dtype: int64

In [17]:
autos = autos[autos['price'] >= 1000]

In [18]:
autos['price'].value_counts().sort_index(ascending=True)

price
1000          4649
1001            11
1003             1
1009             1
1010             2
              ... 
32545461         1
74185296         1
99000000         1
99999999        15
2147483647       1
Name: count, Length: 5095, dtype: int64

In [19]:
autos = autos[autos['price'] <= 27000000]

In [20]:
autos['price'].value_counts().sort_index(ascending=True)

price
1000        4649
1001          11
1003           1
1009           1
1010           2
            ... 
10000000       8
10010011       1
11111111      10
12345678       9
14000500       1
Name: count, Length: 5089, dtype: int64

I decided to remove all cars priced below 1000 and all cars listed at 27 million, as this price represents the most expensive car in the world.


For Km : 

In [21]:
autos['kilometer'].value_counts().sort_index(ascending=True)

kilometer
5000        3634
10000       1723
20000       5027
30000       5640
40000       6184
50000       7343
60000       8257
70000       9193
80000      10195
90000      11294
100000     13693
125000     31976
150000    174014
Name: count, dtype: int64

In [22]:
autos["price"].describe

<bound method NDFrame.describe of 1         18300
2          9800
3          1500
4          3600
6          2200
          ...  
371523     2200
371524     1199
371525     9200
371526     3400
371527    28990
Name: price, Length: 288173, dtype: int64>

For the price, the data seems accurate.

### Exploring the Date Columns

Three columns (date_crawled, last_seen, ad_created) are stored as strings and need to be converted to a numerical or datetime format for analysis. The other two (registration_month, registration_year) are already numeric and can be analyzed directly.

In [23]:
autos["date_crawled"].head


<bound method NDFrame.head of 1         2016-03-24 10:58:45
2         2016-03-14 12:52:21
3         2016-03-17 16:54:04
4         2016-03-31 17:25:20
6         2016-04-01 20:48:51
                 ...         
371523    2016-03-14 17:48:27
371524    2016-03-05 19:56:21
371525    2016-03-19 18:57:12
371526    2016-03-20 19:41:08
371527    2016-03-07 19:39:19
Name: date_crawled, Length: 288173, dtype: object>

In [24]:
# Convert columns to datetime
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'])
autos['ad_created'] = pd.to_datetime(autos['ad_created'])
autos['last_seen'] = pd.to_datetime(autos['last_seen'])

In [25]:
# Extract only the date part (remove the time)
autos['date_crawled'] = autos['date_crawled'].dt.date
autos['ad_created'] = autos['ad_created'].dt.date
autos['last_seen'] = autos['last_seen'].dt.date


In [26]:
# Calculate the distribution as percentages and sort by date for each column
date_crawled_dist = autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index()
ad_created_dist = autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index()
last_seen_dist = autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index()


In [27]:
# Print the results
print("Distribution of date_crawled:")
print(date_crawled_dist)

print("\nDistribution of ad_created:")
print(ad_created_dist)

print("\nDistribution of last_seen:")
print(last_seen_dist)

Distribution of date_crawled:
date_crawled
2016-03-05    0.025835
2016-03-06    0.014679
2016-03-07    0.035208
2016-03-08    0.032973
2016-03-09    0.033469
2016-03-10    0.032539
2016-03-11    0.032762
2016-03-12    0.036617
2016-03-13    0.016077
2016-03-14    0.036374
2016-03-15    0.032505
2016-03-16    0.029694
2016-03-17    0.030995
2016-03-18    0.012944
2016-03-19    0.035489
2016-03-20    0.036641
2016-03-21    0.035132
2016-03-22    0.032036
2016-03-23    0.031793
2016-03-24    0.029847
2016-03-25    0.032939
2016-03-26    0.032748
2016-03-27    0.030856
2016-03-28    0.035291
2016-03-29    0.033858
2016-03-30    0.033164
2016-03-31    0.031516
2016-04-01    0.034705
2016-04-02    0.035982
2016-04-03    0.039761
2016-04-04    0.037970
2016-04-05    0.012763
2016-04-06    0.003210
2016-04-07    0.001627
Name: proportion, dtype: float64

Distribution of ad_created:
ad_created
2014-03-10    0.000003
2015-03-20    0.000003
2015-06-11    0.000003
2015-06-18    0.000003
2015-08-07

In [28]:
autos.head(0)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen


In [29]:
autos["registration_year"].describe()

count    288173.000000
mean       2005.147699
std          65.534244
min        1000.000000
25%        2001.000000
50%        2005.000000
75%        2009.000000
max        9999.000000
Name: registration_year, dtype: float64

From our previous exploration, we observed anomalies in the registration_year column: the minimum value is 1000, predating the invention of cars, and the maximum value is 9999, extending far into the future. Given that cars cannot be registered after the listing date, any registration year beyond 2016 is definitely inaccurate. Determining the earliest valid year is more complex but likely falls within the early 1900s. We will count the number of listings with registration years outside the 1900-2016 range to decide whether to remove these rows or apply more nuanced logic.

In [31]:
autos['registration_year'].value_counts().sort_index(ascending=True)

registration_year
1000    10
1001     1
1039     1
1111     1
1234     1
        ..
7800     1
8500     1
8888     2
9000     3
9999     7
Name: count, Length: 128, dtype: int64

In [32]:
autos = autos[autos['registration_year'] >= 1900]
autos = autos[autos['registration_year'] <=  2016]

In [33]:
autos['registration_year'].value_counts().sort_index(ascending=True)

registration_year
1910      14
1911       1
1923       3
1925       1
1927       2
        ... 
2012    9319
2013    6067
2014    4691
2015    2704
2016    4868
Name: count, Length: 94, dtype: int64

In [35]:
autos[autos['registration_year'] < 1950]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
2018,2016-03-25,Volkswagen_Andere_typ82,privat,Angebot,7000,test,suv,1945,manuell,48,andere,150000,2,benzin,volkswagen,nein,2016-03-25,0,58135,2016-03-25
2376,2016-03-15,Andere_Andere,privat,Angebot,1800,control,cabrio,1925,,0,,5000,1,,sonstige_autos,nein,2016-03-15,0,79288,2016-04-07
10549,2016-03-17,Ford_V8_Cabrio_im_restaurierten_Topzustand,privat,Angebot,35000,control,cabrio,1937,manuell,90,andere,5000,6,benzin,ford,nein,2016-03-17,0,50859,2016-04-07
10691,2016-03-27,Tausche/Verkaufe_S51_gegen_Audi_ab_150ps_oder_...,privat,Angebot,1250,test,,1910,,0,andere,5000,0,,audi,,2016-03-27,0,18445,2016-04-07
12468,2016-03-22,Opel_Olympia_Cabrio_1936_Top_Zustand,privat,Angebot,21900,control,cabrio,1936,manuell,23,andere,5000,7,benzin,opel,nein,2016-03-22,0,78467,2016-04-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352061,2016-03-15,Oldtimer_Aero_30_Roadster_1938,privat,Angebot,17000,test,cabrio,1944,manuell,30,,5000,1,benzin,sonstige_autos,nein,2016-03-15,0,45897,2016-03-26
352421,2016-03-11,BMW_DIXI_OLDTIMER,privat,Angebot,20000,control,coupe,1929,manuell,12,andere,10000,4,benzin,bmw,,2016-03-11,0,74523,2016-04-07
354533,2016-03-05,Ford_Business_Coupe_Hotrod_Projekt.1937,privat,Angebot,7000,test,coupe,1937,manuell,85,andere,5000,8,benzin,ford,ja,2016-03-05,0,8359,2016-04-07
362834,2016-03-26,Oldtimer_Unikat_Volkswagen_Anfibio_Schwimmwage...,privat,Angebot,18900,control,suv,1943,manuell,60,andere,150000,3,benzin,volkswagen,nein,2016-03-26,0,51065,2016-03-26


In [36]:
autos = autos[autos['registration_year'] >= 1950]

In [37]:
autos.value_counts(normalize=True)

date_crawled  name                                                     seller  offer_type  price  abtest   vehicle_type  registration_year  gearbox    power_ps  model     kilometer  registration_month  fuel_type  brand       unrepaired_damage  ad_created  nr_of_pictures  postal_code  last_seen 
2016-03-22    XC_90_D5_SPORT_DYNAUDIO_7_SITZE_1.HAND_S_HEFT            privat  Angebot     10000  control  suv           2008               automatik  185       xc_reihe  90000      1                   diesel     volvo       nein               2016-03-22  0               10115        2016-03-22    0.000014
              Touareg_3.0_V6_TDI_DPF_Aut._Bi_Xen_1_Hand_Kamera         privat  Angebot     10100  control  suv           2008               automatik  239       touareg   100000     12                  diesel     volkswagen  nein               2016-03-22  0               10115        2016-03-22    0.000014
2016-03-11    A3_2.0_TDI_Sportback_S_tronic_2X_S_LINE_PANORAMA         privat  Angebot  

In [38]:
autos['registration_year'].value_counts(normalize=True, dropna=False).sort_index()

registration_year
1950    0.000058
1951    0.000061
1952    0.000040
1953    0.000061
1954    0.000050
          ...   
2012    0.033611
2013    0.021882
2014    0.016919
2015    0.009753
2016    0.017558
Name: proportion, Length: 67, dtype: float64

The distribution of registration_year shows that earlier years, such as 1950 and 1951, have very low proportions, indicating that cars from those years are quite rare in the dataset. The proportion increases steadily up to 2016, with a notable peak around 2012. This suggests that more recent years are better represented in the dataset. Specifically, years like 2012 and 2013 have higher proportions compared to earlier years, indicating that there are more listings for newer cars. The proportions for years closer to 2016, like 2014 and 2015, show a decrease, which might suggest a drop in listings for very recent cars or could reflect a possible data collection bias.

### Exploring Price by Brand

In [42]:
autos["brand"].unique()

array(['audi', 'jeep', 'volkswagen', 'skoda', 'peugeot', 'ford', 'mazda',
       'nissan', 'renault', 'mercedes_benz', 'bmw', 'honda', 'opel',
       'mini', 'smart', 'hyundai', 'subaru', 'volvo', 'mitsubishi',
       'alfa_romeo', 'kia', 'seat', 'suzuki', 'lancia', 'porsche',
       'citroen', 'fiat', 'toyota', 'chevrolet', 'sonstige_autos',
       'dacia', 'daihatsu', 'chrysler', 'jaguar', 'rover', 'saab',
       'daewoo', 'land_rover', 'trabant', 'lada'], dtype=object)

In [64]:
brand_analyse = autos["brand"].value_counts()

In [75]:
brand_analyse

brand
volkswagen        58804
bmw               34651
mercedes_benz     30875
audi              27888
opel              24261
ford              16019
renault           10291
peugeot            7990
fiat               5808
skoda              5097
seat               4938
smart              4571
toyota             4104
mazda              4031
citroen            3782
nissan             3342
mini               3224
hyundai            2994
sonstige_autos     2759
volvo              2426
porsche            2120
kia                2081
honda              1986
mitsubishi         1839
suzuki             1714
alfa_romeo         1694
chevrolet          1629
chrysler           1024
dacia               847
land_rover          746
jeep                738
jaguar              581
subaru              484
saab                391
daihatsu            385
trabant             267
lancia              254
daewoo              243
rover               209
lada                173
Name: count, dtype: int64

I have decided to choose Volkswagen for the analysis because it has the largest amount of data, which will provide more weight to the results.

In [54]:
brands = {}

In [49]:
volkswagen_brands = autos[autos["brand"] == "volkswagen"]

In [50]:
volkswagen_brands

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
3,2016-03-17,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17,0,91074,2016-03-17
11,2016-04-07,Volkswagen_Passat_Variant_2.0_TDI_Comfortline,privat,Angebot,2799,control,kombi,2005,manuell,140,passat,150000,12,diesel,volkswagen,ja,2016-04-07,0,57290,2016-04-07
13,2016-03-21,VW_PASSAT_1.9_TDI_131_PS_LEDER,privat,Angebot,2500,control,kombi,2004,manuell,131,passat,150000,2,,volkswagen,nein,2016-03-21,0,90762,2016-03-23
20,2016-04-01,Volkswagen_Scirocco_1.4_TSI_Sport,privat,Angebot,10400,control,coupe,2009,manuell,160,scirocco,100000,4,benzin,volkswagen,nein,2016-04-01,0,75365,2016-04-05
28,2016-03-09,Volkswagen_T3_andere,privat,Angebot,1990,test,bus,1981,manuell,50,transporter,5000,1,benzin,volkswagen,nein,2016-03-09,0,87471,2016-03-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371509,2016-03-11,LT_35_DIESEL_Gruene_Plakette....letzte_Gelegen...,privat,Angebot,1900,test,,2000,manuell,110,,150000,7,,volkswagen,nein,2016-03-11,0,87700,2016-03-12
371516,2016-04-04,Volkswagen_Lupo_1.0,privat,Angebot,1490,control,kleinwagen,1998,manuell,50,lupo,150000,9,benzin,volkswagen,nein,2016-04-04,0,48653,2016-04-06
371517,2016-03-28,Volkswagen_Golf_2.0_TDI_DPF_Team,privat,Angebot,7900,test,limousine,2010,manuell,140,golf,150000,7,diesel,volkswagen,nein,2016-03-28,0,75223,2016-04-02
371525,2016-03-19,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19,0,87439,2016-04-07


In [60]:
volkswagen_mean_price = volkswagen_brands["price"].mean()
brands = {"volkswagen" : volkswagen_mean_price }

In [61]:
print(brands)

{'volkswagen': 8761.34550370723}


We can see that the mean price for volkswagen cars is 8761.34

In [None]:
I create a function to automatise this workflow

In [78]:
brands_dict = {}

def update_brand_mean_price(dataframe, brand_name, brands_dict):
    # Filter the DataFrame by the brand name
    brand_data = dataframe[dataframe['brand'] == brand_name]
    
    # Calculate the mean price
    mean_price = brand_data['price'].mean()
    
    # Update the dictionary
    brands_dict[brand_name] = mean_price

In [79]:
for brand_name in brand_analyse.index:
    update_brand_mean_price(autos, brand_name, brands_dict)

In [80]:
brands_dict

{'volkswagen': 8761.34550370723,
 'bmw': 9942.717843640876,
 'mercedes_benz': 9742.940340080972,
 'audi': 11333.108039300057,
 'opel': 4814.637896212028,
 'ford': 6806.517135901117,
 'renault': 3559.205908075017,
 'peugeot': 4088.5470588235294,
 'fiat': 6255.520661157025,
 'skoda': 6927.274475181479,
 'seat': 5767.324827865533,
 'smart': 3887.7481951432947,
 'toyota': 5769.293372319688,
 'mazda': 7685.189034978914,
 'citroen': 4608.209677419355,
 'nissan': 6402.149311789348,
 'mini': 10168.12034739454,
 'hyundai': 6331.589846359386,
 'sonstige_autos': 33420.569771656395,
 'volvo': 6658.387469084913,
 'porsche': 53087.16084905661,
 'kia': 6648.516098029793,
 'honda': 5069.84944612286,
 'mitsubishi': 4969.922240348015,
 'suzuki': 5621.380980163361,
 'alfa_romeo': 5369.078512396694,
 'chevrolet': 8098.6967464702275,
 'chrysler': 5171.435546875,
 'dacia': 6033.410861865407,
 'land_rover': 17180.533512064343,
 'jeep': 12797.822493224932,
 'jaguar': 15023.905335628227,
 'subaru': 6364.237603

In [81]:
sorted_brands_dict = dict(sorted(brands_dict.items(), key=lambda item: item[1], reverse=True))

In [82]:
sorted_brands_dict

{'porsche': 53087.16084905661,
 'trabant': 44682.50187265917,
 'sonstige_autos': 33420.569771656395,
 'land_rover': 17180.533512064343,
 'jaguar': 15023.905335628227,
 'jeep': 12797.822493224932,
 'audi': 11333.108039300057,
 'mini': 10168.12034739454,
 'bmw': 9942.717843640876,
 'mercedes_benz': 9742.940340080972,
 'volkswagen': 8761.34550370723,
 'chevrolet': 8098.6967464702275,
 'mazda': 7685.189034978914,
 'skoda': 6927.274475181479,
 'ford': 6806.517135901117,
 'volvo': 6658.387469084913,
 'kia': 6648.516098029793,
 'nissan': 6402.149311789348,
 'subaru': 6364.237603305785,
 'hyundai': 6331.589846359386,
 'fiat': 6255.520661157025,
 'dacia': 6033.410861865407,
 'toyota': 5769.293372319688,
 'seat': 5767.324827865533,
 'suzuki': 5621.380980163361,
 'lancia': 5402.216535433071,
 'alfa_romeo': 5369.078512396694,
 'chrysler': 5171.435546875,
 'honda': 5069.84944612286,
 'mitsubishi': 4969.922240348015,
 'saab': 4947.028132992327,
 'opel': 4814.637896212028,
 'citroen': 4608.2096774193

### Storing Aggregate Data in a DataFrame

We aim to find a link between brands, price, and mileage.

In [89]:
mileage_dict = {}

def update_brand_mean_mileage(dataframe, brand_name, mileage_dict):
  
    brand_data = dataframe[dataframe['brand'] == brand_name]

    mean_mileage = brand_data['kilometer'].mean()
    
    mileage_dict[brand_name] = mean_mileage

for brand_name in brand_analyse.index:
    update_brand_mean_mileage(autos, brand_name, mileage_dict)

In [90]:
sorted_mileage_dict = dict(sorted(mileage_dict.items(), key=lambda item: item[1], reverse=True))

In [91]:
sorted_mileage_dict

{'saab': 139808.18414322252,
 'volvo': 136065.53998351196,
 'bmw': 132183.77536001848,
 'chrysler': 131855.46875,
 'mercedes_benz': 129777.81376518219,
 'rover': 128803.82775119618,
 'audi': 128045.75444635686,
 'alfa_romeo': 125796.93034238489,
 'volkswagen': 125510.08434800354,
 'opel': 123066.85627138206,
 'jaguar': 122865.7487091222,
 'subaru': 121580.57851239669,
 'jeep': 121124.66124661246,
 'mazda': 120983.62689159017,
 'honda': 120944.1087613293,
 'renault': 120898.35778835876,
 'peugeot': 119752.19023779724,
 'mitsubishi': 119676.45459488852,
 'land_rover': 119296.24664879356,
 'ford': 119290.52999563019,
 'daewoo': 117530.86419753087,
 'citroen': 115708.61977789529,
 'toyota': 115063.35282651072,
 'lancia': 114901.5748031496,
 'seat': 114721.54718509519,
 'skoda': 112429.86070237395,
 'nissan': 111710.0538599641,
 'fiat': 107434.57300275483,
 'daihatsu': 106909.09090909091,
 'kia': 105055.26189332052,
 'suzuki': 101423.57059509918,
 'hyundai': 99604.20841683367,
 'porsche': 9

Convert both dictionaries to series objects

In [92]:
mean_price_series = pd.Series(brands_dict)
mileage_mean_series = pd.Series(mileage_dict)

In [93]:
df = pd.DataFrame(mean_price_series, columns=['mean_price'])

In [94]:
df

Unnamed: 0,mean_price
volkswagen,8761.345504
bmw,9942.717844
mercedes_benz,9742.94034
audi,11333.108039
opel,4814.637896
ford,6806.517136
renault,3559.205908
peugeot,4088.547059
fiat,6255.520661
skoda,6927.274475


In [95]:
df["mean_mileage"] = mileage_mean_series

In [96]:
df

Unnamed: 0,mean_price,mean_mileage
volkswagen,8761.345504,125510.084348
bmw,9942.717844,132183.77536
mercedes_benz,9742.94034,129777.813765
audi,11333.108039,128045.754446
opel,4814.637896,123066.856271
ford,6806.517136,119290.529996
renault,3559.205908,120898.357788
peugeot,4088.547059,119752.190238
fiat,6255.520661,107434.573003
skoda,6927.274475,112429.860702


The aggregated data reveals significant variations in average prices and mileage across car brands. Luxury brands such as Porsche, Jaguar, and Land Rover have the highest average prices, with Porsche leading at €53,087, followed by Jaguar and Land Rover at €15,024 and €17,181, respectively. Conversely, budget brands like Dacia, Daewoo, and Trabant show the lowest average prices, with Dacia at €6,033, Daewoo at €1,538, and Trabant at €44,683, which might reflect atypical data or specific segmentation.

Mileage also varies notably, with brands like Saab and Peugeot having the highest average mileage, while luxury brands like Porsche have lower mileage. This suggests that luxury vehicles may be better maintained or less used compared to budget models. German brands such as Audi, BMW, and Mercedes-Benz, while having relatively high prices, have average mileages close to the overall average, indicating a balance between cost and wear. Overall, the analysis highlights significant trends in pricing and mileage across different car brands, reflecting variations based on price category and market segment.