# Car Price Prediction Dataset Exploration

## 1. Dataset Introduction

This dataset contains used-car listings with attributes such as brand, model, year, mileage, engine size, fuel type, condition, and price.  
It is structured, readable, and contains a good mix of numerical and categorical features that are useful for exploratory data analysis.

## 2. Conclusion

After reviewing this dataset, I decided **to select** the Car Price dataset for my final project.  
It offers strong analytical potential, realistic numerical variables, clean categorical fields, and clear relationships between features such as mileage, year, engine size, and price.  
Compared with the other datasets I explored, this one provides the richest structure for meaningful EDA and visualizations.

## 3. Setup and Initial Inspection

In [1]:
import pandas as pd

df = pd.read_csv("datasets/car_price_prediction_.csv")
df.head()

Unnamed: 0,Car ID,Brand,Year,Engine Size,Fuel Type,Transmission,Mileage,Condition,Price,Model
0,1,Tesla,2016,2.3,Petrol,Manual,114832,New,26613.92,Model X
1,2,BMW,2018,4.4,Electric,Manual,143190,Used,14679.61,5 Series
2,3,Audi,2013,4.5,Electric,Manual,181601,New,44402.61,A4
3,4,Tesla,2011,4.1,Diesel,Automatic,68682,New,86374.33,Model Y
4,5,Ford,2009,2.6,Diesel,Manual,223009,Like New,73577.1,Mustang


### Summary of Initial Observation from `head()`

From the first few rows, the dataset looks realistic and well-structured.  
Columns such as **brand**, **model**, **year**, **mileage**, **engine size**, **fuel type**, **condition**, and **price** are clearly defined and intuitive.

**Core analytical columns likely to be useful:**
- `price`, `mileage`, `year`, `engine size`, `condition`, `brand`

**Notable observations from the head():**
- `Engine Size` appears numeric and clean, but some datasets store it as text (e.g., “2.0L”), so this will need verification.
- `Model` contains many distinct entries and may require grouping for analysis.
- `Brand` and `Fuel Type` look properly categorized with no visible typos in the sample.
- `Price` and `Mileage` appear realistic and suitable for trend and correlation analysis.
- No obvious unrealistic values appear in the head(), unlike the AI Jobs dataset where inconsistencies appeared immediately.

### Initial Potential Analysis Directions
1. How mileage and manufacturing year influence car price.  
2. Price differences and depreciation patterns across brands.  
3. How engine size and vehicle condition relate to price.


## 4. Basic Structure

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car ID        2500 non-null   int64  
 1   Brand         2500 non-null   object 
 2   Year          2500 non-null   int64  
 3   Engine Size   2500 non-null   float64
 4   Fuel Type     2500 non-null   object 
 5   Transmission  2500 non-null   object 
 6   Mileage       2500 non-null   int64  
 7   Condition     2500 non-null   object 
 8   Price         2500 non-null   float64
 9   Model         2500 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 195.4+ KB


### Summary from info()

The `info()` output shows that the dataset contains **2,500 rows and 10 columns**, with a good mix of numeric and categorical fields.  

**Missing values:**  
- No missing values were found in any column.  
- This greatly simplifies cleaning and ensures consistency for analysis.

**Data types:**  
- Numerical columns (`Year`, `Engine Size`, `Mileage`, `Price`) are correctly stored as numeric types.  
- Categorical fields (`Brand`, `Fuel Type`, `Transmission`, `Condition`, `Model`) are stored as objects, which is appropriate.  
- Overall, the data types look clean and suitable for EDA.

**General impression:**  
- The dataset is well-structured and requires minimal preprocessing.

In [3]:
df.describe()


Unnamed: 0,Car ID,Year,Engine Size,Mileage,Price
count,2500.0,2500.0,2500.0,2500.0,2500.0
mean,1250.5,2011.6268,3.46524,149749.8448,52638.022532
std,721.83216,6.9917,1.432053,87919.952034,27295.833455
min,1.0,2000.0,1.0,15.0,5011.27
25%,625.75,2005.0,2.2,71831.5,28908.485
50%,1250.5,2012.0,3.4,149085.0,53485.24
75%,1875.25,2018.0,4.7,225990.5,75838.5325
max,2500.0,2023.0,6.0,299967.0,99982.59


### Summary Statistics (`describe()`)

The `describe()` output provides insight into the scale and distribution of key numeric columns.

**Year**
- Min: 2000, Max: 2023 — realistic production years.
- Median ≈ 2012 — majority of cars are from the 2010s.
- Distribution appears balanced with no extreme outliers.

**Engine Size**
- Range: 1.0L to 6.0L — fits common real-world engine sizes.
- Middle 50% (Q1–Q3): ~2.2L to 4.7L — reasonable for typical vehicles.

**Mileage**
- Min: 15 km, Max: 299,967 km — wide but realistic.
- Median ≈ 149,000 km — typical for used cars.
- Interquartile range suggests a broad variety of usage conditions.

**Price**
- Range: 5,011 to 99,982 — plausible for used cars.
- Median ≈ 53,485 — suggests a mix of economy and premium vehicles.
- No obvious signs of synthetic or artificial patterns.

**Overall:**  
The numeric features show realistic variation, appropriate ranges, and no obvious anomalies.


## 4.1 Categorical Column Distribution

In [4]:
cat_cols = df.select_dtypes(include='object').columns

for c in cat_cols:
    print(f"\n=== {c} (unique={df[c].nunique()}) ===")
    print(df[c].value_counts().head(10))



=== Brand (unique=7) ===
Brand
Toyota      374
Audi        368
BMW         358
Mercedes    353
Honda       352
Tesla       348
Ford        347
Name: count, dtype: int64

=== Fuel Type (unique=4) ===
Fuel Type
Diesel      655
Petrol      630
Electric    614
Hybrid      601
Name: count, dtype: int64

=== Transmission (unique=2) ===
Transmission
Manual       1308
Automatic    1192
Name: count, dtype: int64

=== Condition (unique=3) ===
Condition
Used        855
Like New    836
New         809
Name: count, dtype: int64

=== Model (unique=28) ===
Model
Fiesta      103
Corolla     103
A3           98
A4           96
Q7           95
CR-V         95
5 Series     93
3 Series     93
Prius        93
Model X      93
Name: count, dtype: int64


### Categorical Column Review (`value_counts()`)

A quick scan of categorical columns shows:

**Brand**
- Contains many well-known car manufacturers (e.g., Tesla, BMW, Ford).
- No visible spelling inconsistencies.

**Fuel Type**
- Categories such as Petrol, Diesel, Electric, Hybrid appear standard.
- Distribution seems reasonable and diverse.

**Transmission**
- Mainly 'Manual' and 'Automatic' — both common types.
- No unexpected categories.

**Condition**
- Values include 'New', 'Used', 'Like New'.
- Clean and interpretable.

**Model**
- Highly diverse set of model names.
- This column may need grouping or simplification for comparative analysis.

**Conclusion:**  
Categorical fields appear clean, consistently formatted, and free from obvious errors.


## 5. Confirmed Analysis Directions

Based on the dataset structure, variable quality, and the patterns observed during exploration, the following analysis directions are the most suitable:

1. **Identify the key factors that influence car price**, focusing on mileage, year, engine size, brand, and condition.  
2. **Compare price patterns across different brands**, examining which brands retain value better.  
3. **Analyze depreciation effects**, especially how mileage and manufacturing year jointly affect price.

These directions are realistic, well-supported by the dataset, and provide clear pathways for meaningful visualizations.


### 6. Strengths and Weaknesses

**Strengths**
- Contains real-world automotive attributes with realistic value ranges.
- No missing values, making cleaning fast and straightforward.
- Columns cover multiple dimensions: price, performance, usage, and brand.

**Weaknesses**
- `Model` has very high cardinality, making aggregation harder.
- Engine size does not include units in some datasets (though this one appears clean).
- Categorical distributions may require grouping for meaningful visualization.

