# Exploratory Data Analysis – Otomoto Used Cars Listings (2000–2025)

This notebook contains exploratory analysis on ~226,000 used car listings collected from otomoto.pl.  
The goal is to understand market structure, pricing factors, and identify insights useful for car traders and buyers in Poland.

**Dataset**: Scraped in 2025 from otomoto.pl  
**Tools**: Python (pandas, matplotlib, seaborn)

👨‍💻 Author: Łukasz Pindus

### Note on Equipment Data

The dataset contains a separate column with car equipment information (e.g. navigation, leather seats, parking sensors).  
However, due to the high variability, low standardization, and the exploratory nature of this analysis, the equipment data was excluded from the initial scope.

Future versions of this project may include:
- standardization of equipment categories,
- analysis of premium features and their impact on price,
- extraction of feature importance using NLP or one-hot encoding.

The current analysis focuses on structured numeric and categorical attributes only (brand, model, year, price, mileage, fuel, gearbox, etc.).

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine, text
sns.set(style="whitegrid")

In [15]:
engine = create_engine("postgresql+psycopg2://postgres:1234@localhost:5432/car_database")
text_query = """SELECT * FROM listings"""

df = pd.read_sql(text_query, engine)
df.head()

Unnamed: 0,id,local_id,make,model,version,color,number_of_doors,number_of_seats,production_year,generation,...,range_km,engine_capacity_cm3,power_hp,co2_emissions_gpkm,urban_fuel_consumption_l_per_100km,extraurban_fuel_consumption_l_per_100km,mileage_km,average_energy_consumption_kwh_per_100km,battery_health_percent,max_electric_power_hp
0,1,1,Ford,Kuga,,Błękitny,5.0,5.0,2019,II (2012-),...,,,120.0,143.0,7.8,5.3,107345,,,
1,2,2,Citroën,C5 Aircross,2.0 BlueHDi C-Series EAT8,Szary,5.0,,2019,,...,,,178.0,,,,158000,,,
2,3,3,Mercedes-Benz,CLA,220 4-Matic AMG Line 7G-DCT,Szary,4.0,5.0,2019,II (2019-),...,,,190.0,149.0,8.9,5.2,81000,,,
3,4,4,Jaguar,XF,20d Portfolio,Beżowy,4.0,5.0,2019,X260 (2015-),...,,,180.0,135.0,6.4,4.4,80000,,,
4,5,5,Peugeot,508,1.6 PureTech GT S&S EAT8,Szary,5.0,5.0,2019,II (2018-),...,,,225.0,119.0,6.7,4.4,120000,,,


## Basic Data Overview

In [11]:
print("Shape:", df.shape)
df.info()

Shape: (226088, 45)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226088 entries, 0 to 226087
Data columns (total 45 columns):
 #   Column                                    Non-Null Count   Dtype         
---  ------                                    --------------   -----         
 0   id                                        226088 non-null  int64         
 1   local_id                                  226088 non-null  int64         
 2   make                                      226086 non-null  object        
 3   model                                     226088 non-null  object        
 4   version                                   154783 non-null  object        
 5   color                                     226088 non-null  object        
 6   number_of_doors                           225263 non-null  float64       
 7   number_of_seats                           211394 non-null  float64       
 8   production_year                           226088 non-null  int64         


### Basic Statistics for Numeric Columns

In [12]:
df.describe()

Unnamed: 0,id,local_id,number_of_doors,number_of_seats,production_year,number_of_engines,number_of_batteries,price,advert_date,advert_id,battery_capacity_kwh,range_km,power_hp,co2_emissions_gpkm,urban_fuel_consumption_l_per_100km,extraurban_fuel_consumption_l_per_100km,mileage_km,average_energy_consumption_kwh_per_100km,battery_health_percent,max_electric_power_hp
count,226088.0,226088.0,225263.0,211394.0,226088.0,2046.0,256.0,226088.0,174360,174360.0,2108.0,2214.0,225830.0,99023.0,113250.0,97887.0,226088.0,1603.0,140.0,2608.0
mean,113044.5,113044.5,4.736597,5.057901,2016.662968,1.205767,3.164062,90892.32,2025-03-31 05:03:41.988299776,6134372000.0,71.225142,470.36495,170.387894,156.088591,7.520987,5.011073,132339.9,17.154367,96.835714,146.95092
min,1.0,1.0,2.0,1.0,2000.0,1.0,1.0,400.0,2024-07-24 22:16:00,6012393000.0,1.0,2.0,1.0,0.15,2.0,2.0,1.0,0.14,12.0,3.0
25%,56522.75,56522.75,5.0,5.0,2013.0,1.0,1.0,32800.0,2025-03-27 12:40:00,6133941000.0,50.0,320.0,116.0,120.0,5.7,4.0,56200.0,15.0,96.0,72.0
50%,113044.5,113044.5,5.0,5.0,2018.0,1.0,1.0,63200.0,2025-04-05 17:54:00,6135922000.0,70.0,420.0,150.0,139.0,7.0,4.9,125594.0,16.2,99.0,113.0
75%,169566.25,169566.25,5.0,5.0,2021.0,1.0,1.0,112900.0,2025-04-10 18:57:00,6136443000.0,82.0,503.0,190.0,162.0,8.5,5.6,193147.8,18.0,100.0,170.0
max,226088.0,226088.0,6.0,9.0,2025.0,4.0,96.0,3500000.0,2025-04-15 22:58:00,6136859000.0,1078.0,134000.0,2737.0,153144.0,34.5,45.0,2066593.0,195.0,100.0,1020.0
std,65266.128168,65266.128168,0.652953,0.729591,5.523754,0.419787,12.033468,107491.1,,3938532.0,61.720656,2842.027817,89.628012,1113.24873,2.839097,1.372863,91784.7,8.711312,9.353652,122.287361


### Null Values (Top 10 columns)

In [16]:
df.isnull().sum().sort_values(ascending=False).head(10)

registration_number                         226088
price_level                                 226088
engine_capacity_cm3                         226088
first_registration_date                     226088
battery_health_percent                      225948
number_of_batteries                         225832
charging_connector_type                     225706
average_energy_consumption_kwh_per_100km    224485
number_of_engines                           224042
battery_capacity_kwh                        223980
dtype: int64