<div align="center" style="font-family: 'Consolas', monospace;"><h1> Data Exploration For Car Price Predictor </h1> </div>

<p align = "center" style="font-family: 'Consolas', monospace;"> The purpose of this notebook is to perform an initial exploratory data analysis (EDA) on the dataset for car price prediction. I will clean, visualize, and understand the data before building a predictive model.</p>

<br><ul> <li style="font-family: 'Consolas', monospace;">Importing Necessary Libraries</li></ul>

In [69]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<ul><li style="font-family: 'Consolas', monospace;"> Loading The Dataset </li></ul>

In [None]:
df = pd.read_csv('Data/Car_Price_Data.csv')
print(df.head())

<ul><li style="font-family: 'Consolas', monospace;"> Basic Information About Dataset </li></ul>

In [None]:
print(f"Dataset Shape: {df.shape}")
df.info()
df.describe()

<ul><li style="font-family: 'Consolas', monospace;"> Check for missing values </li></ul>

In [None]:
print ("Missing Values :\n", df.isnull().sum().values.sum())

<p style="font-family: 'Consolas', monospace;"> No missing values found</p>
<ul><li style="font-family: 'Consolas', monospace;"> Data Cleaning </li></ul>

In [73]:
# String Standarization
df['CarName'] = df['CarName'].str.lower().str.strip()
df['fueltype'] = df['fueltype'].str.lower().str.strip()
df['aspiration'] = df['aspiration'].str.lower().str.strip()
df['doornumber'] = df['doornumber'].str.lower().str.strip()
df['carbody'] = df['carbody'].str.lower().str.strip()
df['drivewheel'] = df['drivewheel'].str.lower().str.strip()
df['enginelocation'] = df['enginelocation'].str.lower().str.strip()
df['fuelsystem'] = df['fuelsystem'].str.lower().str.strip()
df['enginetype'] = df['enginetype'].str.lower().str.strip()
df['cylindernumber'] = df['cylindernumber'].str.lower().str.strip()
df['doornumber'] = df['doornumber'].str.lower().str.strip()


<ul><li style="font-family: 'Consolas', monospace;"> Data Distribution </li></ul>

In [None]:
df.hist(figsize=(24, 16), bins=30)
plt.show()

In [None]:
# Create a 3x3 grid of subplots
fig, axes = plt.subplots(4, 3, figsize=(24, 16))

# List of categorical columns
categorical_columns = ["fueltype", "carbody", "doornumber", "drivewheel", 
                       "enginelocation", "enginetype", "fuelsystem", "aspiration", "CarName",
                       "cylindernumber" , "doornumber"]

# Flatten the 2D axes array to easily loop through it
axes = axes.flatten()

# Plot countplots for each categorical feature
for i, col in enumerate(categorical_columns):
    sns.countplot(y=df[col], ax=axes[i])
    axes[i].set_title(f"{col} Distribution")
    axes[i].tick_params(axis='y', labelsize=12)  # Adjust label size for readability

# Adjust layout
plt.tight_layout()
plt.show()


<ul><li style="font-family: 'Consolas', monospace;"> Dropping Ineffective Datacolumns </li></ul>
<p style="font-family: 'Consolas', monospace;">car_ID: IDs of car does not add any learning value to the data</p>
<p style="font-family: 'Consolas', monospace;">carName: Name of cars also does not add any learning value to the set</p>
<p style="font-family: 'Consolas', monospace;">enginelocation: extremly low variance, does not add learning value</p>

In [76]:
df.drop(columns=["car_ID", "CarName" , "enginelocation"], inplace=True)


<ul><li style="font-family: 'Consolas', monospace;"> Save Cleaned Dataset </li></ul>

In [None]:
df.to_csv("data/Cleaned_Car_Price_Data.csv", index=False)
print("Cleaned dataset saved.")