## EDA - Dataset 04 - vehicle_dataset

### 1. Packages & Settings

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Advanced analysis
from scipy import stats
from ydata_profiling import ProfileReport

# Interactive tables (optional)
from itables import init_notebook_mode, show
init_notebook_mode(all_interactive=True)

# Configuration
pd.set_option('display.max_columns', 30)
sns.set_theme(style='whitegrid')
%config InlineBackend.figure_format = 'retina'
np.random.seed(42)  # Reproducibility

### 2. Importing Data

In [2]:
# Data Loading
df = pd.read_csv(r"E:\Work\DataSciencePortfolio\0_Data\0.04_vehicle_dataset\0.04.1_Raw\car details v4.csv")

### 3.1 First Look

In [3]:
# Sample data
show(df.sample(5))  # Random rows to avoid bias

# Summary of the DataFrame
df.info()

# Basic Statistics of the DataFrame
df.describe()

Unnamed: 0,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type,Engine,Max Power,Max Torque,Drivetrain,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
Loading ITables v2.2.5 from the init_notebook_mode cell... (need help?),,,,,,,,,,,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2059 entries, 0 to 2058
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Make                2059 non-null   object 
 1   Model               2059 non-null   object 
 2   Price               2059 non-null   int64  
 3   Year                2059 non-null   int64  
 4   Kilometer           2059 non-null   int64  
 5   Fuel Type           2059 non-null   object 
 6   Transmission        2059 non-null   object 
 7   Location            2059 non-null   object 
 8   Color               2059 non-null   object 
 9   Owner               2059 non-null   object 
 10  Seller Type         2059 non-null   object 
 11  Engine              1979 non-null   object 
 12  Max Power           1979 non-null   object 
 13  Max Torque          1979 non-null   object 
 14  Drivetrain          1923 non-null   object 
 15  Length              1995 non-null   float64
 16  Width 

Unnamed: 0,Price,Year,Kilometer,Length,Width,Height,Seating Capacity,Fuel Tank Capacity
Loading ITables v2.2.5 from the init_notebook_mode cell... (need help?),,,,,,,,


### 3.2 Results of First Look:
##### Observations:
- small sized dataframe (2059 x 20)
- There are missing values in: Engine, Max Power, Max Torque, Drivetrain, Length, Width, Height, Seating Capacity, Fuel Tank Capacity.
- Data types look good, no correction of data types necessary. Year, Max Power, Max Torque *CAN* be adjusted.
##### Potential Problems:
- Large outliers/deviations in terms of price, probably extreme sports cars or old timers.
- Columns could be renamed to be clearer about number formatting / units.

### 4.1 Automated Analysis with ydata_profiling

In [None]:
# profile = ProfileReport(df, title="Automated EDA", explorative=True)
# profile.to_notebook_iframe()
# Save to HTML for later review (optional)
# profile.to_file("5.1.04b_automated_eda_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 20/20 [00:00<00:00, 634.40it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### 4.2 Automated EDA Results
##### Observations
- Price and kilometer have very extreme outliers. **Check**
- Max Power / Max Torque has unclear unit usage. Should be **cleaned** and **transformed**.

##### Hypotheses:
- Price seems to correlate with bigger cars, certain brands, year and with lower kilometers driven.
- Drivetrain and transmission seem to have an impact on price. Maybe more modernly engineered systems?

### 5. Necessary Cleaning & Transformation Steps for Python Scripts
##### Cleaning:
- Observe and potentially clean missing values in Engine, Max Power, Max Torque, Drivetrain, Length, Width, Height, Seating Capacity, Fuel Tank Capacity.
- Optimize column naming, potentially clear/update columns and values in: Year, Max Power, Max Torque
##### Transformation:
- Filter and observe Drivetrain and Transmission.
- Check IQR for price and find more correlations, observe size of cars, brands, year.

### 6. Hypothesis-Driven Analysis

In [5]:
# Check hypotheses in 4.2

### 7. Focused Investigation
*When to use*:  
- Drill into subgroups (e.g., "Why do users aged 30-40 have higher churn?")  
- Export specific slices for stakeholder reviews  
*Industry Standard*: Never explore blindly – start with hypotheses from Sections 4-5.

In [6]:
# Check how hyptheses from 3-5 can be explained by the data
"""
show(
    df.query("Income > 70000"),
    column_filters="footer",
    buttons=["copy", "csv"],
    scrollY="300px",
    classes="compact"
)
"""

'\nshow(\n    df.query("Income > 70000"),\n    column_filters="footer",\n    buttons=["copy", "csv"],\n    scrollY="300px",\n    classes="compact"\n)\n'