### Project Overview

This guided project focuses on a dataset of used cars sourced from eBay Kleinanzeigen, the classifieds section of the German eBay website. The project involves cleaning the dataset to address issues such as incorrect data, missing values, and inconsistencies. After the cleaning process, the dataset will be analyzed to uncover patterns and gain insights into the used car market.

In [732]:
import pandas as pd
import numpy as np

In [733]:
autos = pd.read_csv("autos.csv", encoding = "latin1")

In [734]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [735]:
print(autos.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [736]:
print(autos.head())

           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere 

### Dataset Overview

The dataset contains information on 50,000 cars, with 20 columns describing various features of each vehicle. Most of the column values are in string format, and some columns contain null values. These issues will be addressed during the data cleaning process to ensure the dataset is prepared for analysis.  

In [737]:
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


In [738]:
corrected_columns = {
    'dateCrawled': 'date_crawled', 
    'name': 'name', 
    'seller': 'seller', 
    'offerType': 'offer_type', 
    'price': 'price', 
    'abtest': 'abtest',
    'vehicleType': 'vehicle_type', 
    'yearOfRegistration': 'registration_year', 
    'gearbox': 'gearbox', 
    'powerPS': 'power_ps', 
    'model': 'model',
    'odometer': 'odometer', 
    'monthOfRegistration': 'registration_month', 
    'fuelType': 'fuel_type', 
    'brand': 'brand',
    'notRepairedDamage': 'unrepaired_damage', 
    'dateCreated': 'ad_created', 
    'nrOfPictures': 'nr_of_pictures', 
    'postalCode': 'postal_code',
    'lastSeen': 'last_seen'
}

In [739]:
autos.rename(columns=corrected_columns, inplace = True)

In [740]:
print(autos.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [741]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Changes Made to the Dataset

   - Column names were updated to be more descriptive, improving clarity and making the dataset easier to understand. 
   
   - The column names were converted from `camelCase` to `snake_case`. This change aligns with the naming conventions commonly used in Python and pandas, enhancing code readability and consistency.
   

These changes improve the overall usability of the dataset by making the column names self-explanatory and the code easier to write and interpret.


In [742]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


### Changes to `price` and `odometer` Columns

The `price` and `odometer` columns contain numeric values, but they are stored as text. To address this issue:  
1. Non-numeric characters will be removed.  
2. The data type will be converted to `float` for accurate numerical analysis.  
3. Column names will be updated to more descriptive terms for better clarity and usability.  

In [743]:
def data_cleaner(text_value):
    
    text_value = text_value.replace('$', '')
    text_value = text_value.replace('.', '')
    text_value = text_value.replace(',', '')
    text_value = text_value.replace('km', '')
    
    return text_value

In [744]:
price_cleaned = []

for c in autos["price"]:
    
    price_cleaned.append(data_cleaner(c))

In [745]:
autos["price"] = price_cleaned

In [746]:
autos["price"] = autos["price"].astype(int, copy=False)

In [747]:
autos.rename(columns={"price": "price_usd"}, inplace=True)

In [748]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [749]:
odometer_cleaned = []

for c in autos["odometer"]:
    
    odometer_cleaned.append(data_cleaner(c))

In [750]:
autos["odometer"] = odometer_cleaned

In [751]:
autos["odometer"] = autos["odometer"].astype(int, copy=False)


In [752]:
autos.rename(columns={"odometer": "odometer_km"}, inplace=True)

### Outlier Analysis

To ensure the dataset's integrity, the `price_usd` and `odometer_km` columns will be analyzed for outliers by examining their minimum and maximum values. This step will help identify any values that appear unrealistically high or low. Such outliers could distort the analysis and may need to be removed for more accurate results.  


In [753]:
autos["price_usd"].unique().shape

(2357,)

In [754]:
autos["price_usd"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [755]:
autos["price_usd"].value_counts().sort_index(ascending = False)

price_usd
99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: count, Length: 2357, dtype: int64

In [756]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [757]:
autos["odometer_km"].value_counts().sort_index(ascending = False)

odometer_km
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: count, dtype: int64

Finding outliers and removing them

In [758]:
sorted_price = autos["price_usd"].sort_values(ascending = False)

In [759]:
print(sorted_price[0:5])

39705    99999999
42221    27322222
47598    12345678
39377    12345678
27371    12345678
Name: price_usd, dtype: int64


In [760]:
Q1_price = sorted_price.quantile(0.25)

Q3_price = sorted_price.quantile(0.75)

IQR_price = Q3_price - Q1_price

In [761]:
print(IQR_price)

6100.0


In [762]:
lower_bound_price = Q1_price - 1.5 * IQR_price
upper_bound_price = Q3_price + 1.5 * IQR_price

print(lower_bound_price)
print(upper_bound_price)

-8050.0
16350.0


In [763]:
autos = autos[autos["price_usd"].between(1, 16350)]

In [764]:
print(len(autos["price_usd"]))

44795


In [765]:
sorted_odometer = autos["odometer_km"].sort_values(ascending = False)

In [766]:
print(sorted_odometer[0:5])

49999    150000
0        150000
1        150000
49996    150000
49994    150000
Name: odometer_km, dtype: int64


In [767]:
Q1_odometer = sorted_odometer.quantile(0.25)

Q3_odometer = sorted_odometer.quantile(0.75)

IQR_odometer = Q3_odometer - Q1_odometer

In [768]:
print(IQR_odometer)

25000.0


In [769]:
lower_bound_odometer = Q1_odometer - 1.5 * IQR_odometer
upper_bound_odometer = Q3_odometer + 1.5 * IQR_odometer

print(lower_bound_odometer)
print(upper_bound_odometer)   

87500.0
187500.0


In the context of vehicles, 44,700 km is relatively low for an odometer value. This suggests that the Interquartile Range (IQR)-based method of detecting outliers may not align with domain-specific knowledge about what constitutes an "outlier" for odometer readings.

Max value of `odometer_km` is 150000 km and it is realistic number in case of vehicles. So there is no outliers in case of `odometer_km`.

### Exploring the date columns

In [770]:
autos["date_crawled"] = pd.to_datetime(autos["date_crawled"], errors='coerce')
autos["ad_created"] = pd.to_datetime(autos["ad_created"], errors='coerce')
autos["last_seen"] = pd.to_datetime(autos["last_seen"], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["date_crawled"] = pd.to_datetime(autos["date_crawled"], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["ad_created"] = pd.to_datetime(autos["ad_created"], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["last_seen"] = pd.to_datetime(autos["last_see

In [771]:
autos["date_crawled"].value_counts(normalize=True, dropna=False).sort_index()

date_crawled
2016-03-05 14:06:30    0.000022
2016-03-05 14:06:40    0.000022
2016-03-05 14:07:04    0.000022
2016-03-05 14:07:08    0.000022
2016-03-05 14:07:21    0.000022
                         ...   
2016-04-07 14:07:04    0.000022
2016-04-07 14:30:09    0.000022
2016-04-07 14:30:26    0.000022
2016-04-07 14:36:44    0.000022
2016-04-07 14:36:55    0.000022
Name: proportion, Length: 43358, dtype: float64

In [772]:
autos["ad_created"].value_counts(normalize=True, dropna=False).sort_index()

ad_created
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
2015-12-30    0.000022
                ...   
2016-04-03    0.038799
2016-04-04    0.036701
2016-04-05    0.011653
2016-04-06    0.003237
2016-04-07    0.001183
Name: proportion, Length: 73, dtype: float64

In [773]:
autos["last_seen"].value_counts(normalize=True, dropna=False).sort_index()

last_seen
2016-03-05 14:45:46    0.000022
2016-03-05 14:46:02    0.000022
2016-03-05 14:49:34    0.000022
2016-03-05 15:16:11    0.000022
2016-03-05 15:16:47    0.000022
                         ...   
2016-04-07 14:58:44    0.000067
2016-04-07 14:58:45    0.000022
2016-04-07 14:58:46    0.000022
2016-04-07 14:58:48    0.000067
2016-04-07 14:58:50    0.000067
Name: proportion, Length: 36473, dtype: float64

In [774]:
autos["registration_year"].describe()

count    44795.00000
mean      2003.97339
std         74.81955
min       1000.00000
25%       1999.00000
50%       2003.00000
75%       2007.00000
max       9999.00000
Name: registration_year, dtype: float64

### Issue with `registration_year`

The `registration_year` column contains values ranging from 1000 to 9999, which are not realistic for vehicle registration years. These anomalies indicate the presence of invalid data. To address this issue, we will define a realistic range for vehicle registration years (e.g., 1900 to the current year) and remove any entries that fall outside this range. This step ensures the dataset's accuracy and reliability for analysis.  


In [775]:
print(len(autos[autos["registration_year"].between(1900, 2016)]))

42959


In [776]:
autos_per_year = autos.groupby("registration_year")["registration_year"].count()

In [777]:
print(autos_per_year)

registration_year
1000    1
1001    1
1111    1
1800    2
1910    5
       ..
4800    1
5000    4
5911    1
8888    1
9999    2
Name: registration_year, Length: 85, dtype: int64


In [778]:
autos = autos[autos["registration_year"].between(1900, 2016)]

In [779]:
autos["registration_year"].value_counts(normalize=True)

registration_year
2000    0.073140
1999    0.067227
2005    0.067227
2003    0.062129
2004    0.061920
          ...   
1929    0.000023
1952    0.000023
1941    0.000023
1938    0.000023
1953    0.000023
Name: proportion, Length: 70, dtype: float64

### Observations on `registration_year`

To address the unrealistic values in the `registration_year` column, rows were filtered to retain only those with registration years between 1900 and 2016. This range was chosen based on historical feasibility and the dataset's context, with 2016 being the most recent year in the dataset. By removing rows with invalid registration years, the dataset now reflects a more accurate representation of vehicle registration data.  


In [780]:
(autos["brand"].value_counts() / 42959) * 100

brand
volkswagen        21.476291
opel              11.520287
bmw               10.354059
mercedes_benz      8.813054
audi               7.756233
ford               7.309295
renault            5.076934
peugeot            3.217021
fiat               2.774739
seat               1.908797
skoda              1.699295
mazda              1.608510
nissan             1.578249
smart              1.536349
citroen            1.503759
toyota             1.336158
hyundai            1.054494
volvo              0.933448
mitsubishi         0.870598
sonstige_autos     0.849647
honda              0.833353
mini               0.775158
alfa_romeo         0.700668
kia                0.696012
suzuki             0.637817
chevrolet          0.574967
chrysler           0.374776
dacia              0.286320
daihatsu           0.272353
subaru             0.221141
jeep               0.200191
porsche            0.188552
saab               0.176913
daewoo             0.162946
trabant            0.151307
rover         

In [781]:
selected_brands = ["volkswagen", "opel", "bmw", "mercedes_benz", "audi", "ford", "renault"]
selected_brands_mean = {}

In [782]:
for brand in selected_brands:
    selected_brands_mean[brand] = autos[autos["brand"] == brand]["price_usd"].mean()

In [783]:
print(selected_brands_mean)

{'volkswagen': np.float64(4183.019618469542), 'opel': np.float64(2712.9959587795515), 'bmw': np.float64(5649.20256294964), 'mercedes_benz': np.float64(5259.88087691495), 'audi': np.float64(5703.895858343338), 'ford': np.float64(2944.803821656051), 'renault': np.float64(2281.4612563044475)}


In [784]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [785]:
selected_brands_mileage_mean = {}

In [786]:
for brand in selected_brands:
    selected_brands_mileage_mean[brand] = autos[autos["brand"] == brand]["odometer_km"].mean()

In [787]:
print(selected_brands_mileage_mean)

{'volkswagen': np.float64(132813.24517667462), 'opel': np.float64(130500.10103051121), 'bmw': np.float64(138833.1834532374), 'mercedes_benz': np.float64(138411.25198098258), 'audi': np.float64(139701.3805522209), 'ford': np.float64(127004.77707006369), 'renault': np.float64(128892.70976616231)}


In [788]:
sbm_series = pd.Series(selected_brands_mean)
print(sbm_series)

volkswagen       4183.019618
opel             2712.995959
bmw              5649.202563
mercedes_benz    5259.880877
audi             5703.895858
ford             2944.803822
renault          2281.461256
dtype: float64


In [789]:
sbmm_series = pd.Series(selected_brands_mileage_mean)
print(sbmm_series)

volkswagen       132813.245177
opel             130500.101031
bmw              138833.183453
mercedes_benz    138411.251981
audi             139701.380552
ford             127004.777070
renault          128892.709766
dtype: float64


In [790]:
price_and_mileage = pd.DataFrame(sbm_series, columns=['mean_price'])

In [791]:
price_and_mileage["mean_mileage"] = sbmm_series

In [792]:
price_and_mileage

Unnamed: 0,mean_price,mean_mileage
volkswagen,4183.019618,132813.245177
opel,2712.995959,130500.101031
bmw,5649.202563,138833.183453
mercedes_benz,5259.880877,138411.251981
audi,5703.895858,139701.380552
ford,2944.803822,127004.77707
renault,2281.461256,128892.709766


In [793]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price_usd,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01,0,39218,2016-04-01 14:38:50


In [794]:
print(autos["seller"].unique())
print(autos["offer_type"].unique())
print(autos["abtest"].unique())
print(autos["vehicle_type"].unique())
print(autos["gearbox"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())

['privat' 'gewerblich']
['Angebot']
['control' 'test']
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']
['manuell' 'automatik' nan]
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
['nein' nan 'ja']


In [795]:
translations = {
    "privat": "private",
    "gewerblich": "commercial",
    "Angebot": "offer",
    "control": "control",
    "test": "test",
    "bus": "bus",
    "limousine": "limousine",
    "kleinwagen": "small car",
    "kombi": "station wagon",
    "coupe": "coupe",
    "suv": "suv",
    "cabrio": "convertible",
    "andere": "other",
    "manuell": "manual",
    "automatik": "automatic",
    "lpg": "lpg",
    "benzin": "petrol",
    "diesel": "diesel",
    "cng": "cng",
    "hybrid": "hybrid",
    "elektro": "electric",
    "nein": "no",
    "ja": "yes"
}


In [796]:
autos['seller'] = autos['seller'].map(translations)
autos['offer_type'] = autos['offer_type'].map(translations)
autos['abtest'] = autos['abtest'].map(translations)
autos['vehicle_type'] = autos['vehicle_type'].map(translations)
autos['gearbox'] = autos['gearbox'].map(translations)
autos['fuel_type'] = autos['fuel_type'].map(translations)
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(translations)

In [797]:
print(autos["seller"].unique())
print(autos["offer_type"].unique())
print(autos["abtest"].unique())
print(autos["vehicle_type"].unique())
print(autos["gearbox"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())

['private' 'commercial']
['offer']
['control' 'test']
['bus' 'limousine' 'small car' 'station wagon' nan 'coupe' 'suv'
 'convertible' 'other']
['manual' 'automatic' nan]
['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']
['no' nan 'yes']


In [799]:
autos["date_crawled"] = autos["date_crawled"].dt.strftime("%Y%m%d").astype(int)
autos["ad_created"] = autos["ad_created"].dt.strftime("%Y%m%d").astype(int)
autos["last_seen"] = autos["last_seen"].dt.strftime("%Y%m%d").astype(int)

In [801]:
print(autos["date_crawled"].head())
print(autos["ad_created"].head())
print(autos["ad_created"].head())

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int64
0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: ad_created, dtype: int64
0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: ad_created, dtype: int64


In [802]:
brand_model_combinations = autos['brand'] + ' ' + autos['model']

In [804]:
print(brand_model_combinations.value_counts().head())

volkswagen golf    3487
bmw 3er            2431
volkswagen polo    1604
opel corsa         1589
opel astra         1339
Name: count, dtype: int64


### Most Common Brand-Model Combinations

The analysis of the most common brand-model combinations in the dataset revealed the following results:

- Volkswagen Golf: 3,487 occurrences
- BMW 3er: 2,431 occurrences
- Volkswagen Polo: 1,604 occurrences
- Opel Corsa: 1,589 occurrences
- Opel Astra: 1,339 occurrences

These combinations represent the most frequently listed cars in the dataset, indicating their popularity in the used car market.


In [813]:
mileage_bins = pd.cut(autos['odometer_km'], bins=[0, 50000, 100000, 150000], 
                       labels=['Low', 'Medium', 'High'])

In [814]:
print(mileage_bins)

0          High
1          High
2        Medium
3        Medium
4          High
          ...  
49993      High
49994      High
49996      High
49997       Low
49999      High
Name: odometer_km, Length: 42959, dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']


In [815]:
avg_prices_by_mileage = autos.groupby(mileage_bins, observed=False)['price_usd'].mean()

In [816]:
print(avg_prices_by_mileage)

odometer_km
Low       6782.210781
Medium    6308.536380
High      3505.859592
Name: price_usd, dtype: float64


### Relationship Between Mileage and Price

The analysis shows that cars with lower mileage tend to be more expensive. The average prices based on mileage categories are as follows:

- Low mileage: €6,782.21
- Medium mileage: €6,308.54
- High mileage: €3,505.86

This indicates a clear trend where cars with lower mileage generally have higher prices, reflecting the market's preference for vehicles with less wear and tear.


In [823]:
damaged_cars = autos[autos["unrepaired_damage"] == "yes"]["price_usd"]
non_damaged_cars = autos[autos["unrepaired_damage"] == "no"]["price_usd"]

In [824]:
print(damaged_cars.mean())

1972.4979937583594


In [826]:
print(non_damaged_cars.mean())

4851.269556338724


### Impact of Damage on Car Prices

The analysis reveals that damaged cars are significantly cheaper than non-damaged cars. The average prices for both categories are as follows:

- Damaged cars: 1,972.50 USD
- Non-damaged cars: 4,851.27 USD

This shows that the presence of damage has a substantial negative effect on a car's price, as buyers generally prefer vehicles in better condition.
