# **1. Introduction**
---

**Objective**

The objective of this notebook is to perform an **Exploratory Data Analysis (EDA)** on the car price prediction dataset to uncover insights, understand relationships between features, and identify any data issues like missing values or outliers. These insights will help guide the next steps in feature engineering, where we will transform and optimize features to improve the model's predictive performance.

In this notebook:
1. **EDA** will focus on visualizing and summarizing data patterns and relationships without modifying the raw data.
2. **Feature Engineering** (in a subsequent, separate notebook) will apply data transformations,create new features, and perform feature selection based on the findings from EDA to enhance model training.


### **Dataset Information**

**Context**

With the rise in the variety of cars offering diverse capabilities and features—such as model, production year, brand, fuel type, engine volume, mileage, and many more—the car market has become highly competitive. Buyers are looking for the best features available within their budget, making car price prediction a valuable tool. This challenge leverages a dataset of 19,237 entries for training and 8,245 entries for testing to predict car prices based on their features.

**Data Description**

- **Train.csv**: Contains 19,237 rows and 18 columns, including the target variable, `Price`.
- **Test.csv**: Contains 8,245 rows and 17 columns, without the target variable.

**Attributes**

1. **ID**: Unique identifier for each car entry.
2. **Price**: Target variable representing the price of the car (present only in the training dataset).
3. **Levy**: Additional fee or tax, which might affect the car's overall cost.
4. **Manufacturer**: Brand of the car, such as Toyota, BMW, Ford, etc.
5. **Model**: Specific model name within the manufacturer's lineup.
6. **Prod. year**: Year of production, indicating the car’s age and likely affecting its depreciation.
7. **Category**: Car type or category, such as sedan, SUV, truck, etc.
8. **Leather interior**: Indicates whether the car has leather seats, a feature that might influence buyer preference and price.
9. **Fuel type**: Type of fuel the car uses (e.g., petrol, diesel, electric), which may impact running costs and buyer choice.
10. **Engine volume**: Size of the engine, typically measured in liters, affecting power and fuel efficiency.
11. **Mileage**: Distance the car has traveled, which is often linked to wear and tear.
12. **Cylinders**: Number of engine cylinders, affecting power and performance.
13. **Gear box type**: Type of transmission (e.g., automatic, manual), a factor in driver preference.
14. **Drive wheels**: Describes the drivetrain type (e.g., FWD, RWD, AWD), which impacts handling and traction.
15. **Doors**: Number of doors, which may affect car usability and buyer preference.
16. **Wheel**: Indicates the car’s steering side (e.g., left or right), possibly relevant to regional usage.
17. **Color**: Exterior color of the car, which might impact buyer appeal.
18. **Airbags**: Number of airbags, reflecting the car’s safety features.

This dataset provides a comprehensive view of each car's attributes, allowing for an in-depth exploration of the factors that influence car prices.

## **Load and inspect data**

In [1]:
# Import necessary libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the data into DataFrame
df= pd.read_csv(f'../data/raw_train.csv')

# Display first 5 rows
df.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [None]:
# Dataset Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19237 entries, 0 to 19236
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                19237 non-null  int64  
 1   Price             19237 non-null  int64  
 2   Levy              19237 non-null  object 
 3   Manufacturer      19237 non-null  object 
 4   Model             19237 non-null  object 
 5   Prod. year        19237 non-null  int64  
 6   Category          19237 non-null  object 
 7   Leather interior  19237 non-null  object 
 8   Fuel type         19237 non-null  object 
 9   Engine volume     19237 non-null  object 
 10  Mileage           19237 non-null  object 
 11  Cylinders         19237 non-null  float64
 12  Gear box type     19237 non-null  object 
 13  Drive wheels      19237 non-null  object 
 14  Doors             19237 non-null  object 
 15  Wheel             19237 non-null  object 
 16  Color             19237 non-null  object

In [None]:
# Checking statistical summary
df.describe()

Unnamed: 0,ID,Price,Prod. year,Cylinders,Airbags
count,19237.0,19237.0,19237.0,19237.0,19237.0
mean,45576540.0,18555.93,2010.912824,4.582991,6.582627
std,936591.4,190581.3,5.668673,1.199933,4.320168
min,20746880.0,1.0,1939.0,1.0,0.0
25%,45698370.0,5331.0,2009.0,4.0,4.0
50%,45772310.0,13172.0,2012.0,4.0,6.0
75%,45802040.0,22075.0,2015.0,4.0,12.0
max,45816650.0,26307500.0,2020.0,16.0,16.0


In [None]:
# Check for null values in training dataset
df.isnull().sum()

ID                  0
Price               0
Levy                0
Manufacturer        0
Model               0
Prod. year          0
Category            0
Leather interior    0
Fuel type           0
Engine volume       0
Mileage             0
Cylinders           0
Gear box type       0
Drive wheels        0
Doors               0
Wheel               0
Color               0
Airbags             0
dtype: int64


In [None]:
# Checking for duplicate values
df.duplicated().sum()

np.int64(313)

### **Observations**

1. The dataset contains a total of **18 columns**, out of which **13 are categorical** and **5 are numerical**.
2. The **column names** are not in a standard format (e.g., snake_case) and need to be standardized.
3. The column **"ID"** should be renamed to **"Id"** for consistency.
4. The **"Levy"** column should be of type **int64**, but it may need data cleaning or conversion.
5. The **"prod.year"** column should be renamed to **"manufacturing_year"** and converted to a **DateTime** object for better clarity and analysis.
6. The **"Mileage"** column should be renamed to **"distance_travelled"** to reflect the measurement more accurately.
7. The **"Door"** column should contain **int64** values, but it currently has string values, requiring data type conversion.
8. The **"wheel"** column should be renamed to **"drive_type"** to better describe whether the car is a left-hand or right-hand drive.
9. There are **0 Null values** in the dataset.
10. There are **313 duplicate records** in the dataset.


## **Data Cleaning**

Based on the above observations we will clean the data and prepare it for the analysis