# 01 – Data Exploration and Cleaning
### Car Price Prediction Using Machine Learning
Group Assignment 02 - CCS3012 - Data Analytics  
Submission Date: 16th September 2025

---

### **Group 11**
-  **FC211034 - N.D. Samararathne Kodikara**
-  **FC211013 - N.W.V. Tharindu Pabasara**
-  **FC211025 - W.M.M.C.B. Wijesundara**



---

### **Supervisor**
**Ms. Dilmi Praveena**  
*Faculty of Computing*  
*University of Sri Jayewardenepura*

---



## 📌 Objective
This notebook focuses on understanding the dataset, cleaning it, and preparing it for further analysis and modeling.

**Key Tasks:**
- Load dataset
- Inspect and document schema / anomalies
- Clean & convert data types
- Basic descriptive analysis & visualizations
- Save a cleaned dataset + metadata for modeling notebook


---

### 📂 Input  
 - `car_price_prediction.csv` saved in `Data/raw/`  


### 📦 Output  
- `clean_data.csv` saved in `Data/processed/`

---

### 📊 Dataset Overview
**Dataset:** Car price dataset.  
**columns include**: ID, Price, Levy, Manufacturer, Model, Prod. year, Category, Leather interior, Fuel type, Engine volume, Mileage, Cylinders, Gear box type, Drive wheels, Doors, Wheel, Color, Airbags.


| **Attribute** | **Details** |
|---------------|-------------|
| **Dataset Size** | 19,237 records × 18 features |
| **Data Type** | Structured tabular data (CSV format) |
| **Target Variable** | `Price` (in USD $) |
| **Problem Type** | Regression |
| **Data Source** | [Car Price Prediction Dataset](https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge) |

## Setup & imports

In [43]:
#First we have to Import necessary libraries for data manipulation and visualization.

# Data Manipulation and Utilities
import pandas as pd     # For data manipulation and analysis.
import numpy as np      # For numerical operations.
import re               # For regular expressions.
import warnings         # For managing warnings.

# Data Visualization
import matplotlib.pyplot as plt     # For basic data visualization.
import seaborn as sns               # For statistical data visualization.

from prettytable import PrettyTable # For creating formatted tables in the console.

In [44]:
# Reusable function definitions

# Function to Print Shape of DataFrame
def get_data_shape(data: pd.DataFrame) -> None:
    if data.empty:
        print("DataFrame is empty.")
    else:
        # Print the shape of the DataFrame
        print("DataFrame Dimensions")
        print("------------------------")
        print(f"Rows   : {data.shape[0]}")
        print(f"Columns: {data.shape[1]}\n")

# A function to format value counts into a table. For more elegant and readable notebook.
def value_counts_pretty(series, column_name="Value", n=None, head=True):
    value_counts = series.value_counts(dropna=False)
    percentages = series.value_counts(dropna=False, normalize=True) * 100
    
    # Apply head/tail filtering if n is specified
    if n is not None:
        if head:
            value_counts = value_counts.head(n)
            percentages = percentages.head(n)
        else:
            value_counts = value_counts.tail(n)
            percentages = percentages.tail(n)
    
    table = PrettyTable()        
    table.field_names = [column_name, "Count", "Percentage"]
    
    # Set alignment
    table.align[column_name] = "c"
    table.align["Count"] = "r"
    table.align["Percentage"] = "r"
    
    # Add rows
    for rank, (value, count) in enumerate(value_counts.items(), 1):
        display_value = "NaN/Missing" if pd.isna(value) else str(value)
        pct = percentages[value]
        
        table.add_row([display_value, count, f"{pct:.2f}%"])
    
    return table

In [45]:
# Next we load the dataset.
raw_df = pd.read_csv("./Data/raw/car_price_prediction.csv")


In [46]:
# Prints the first 5 rows of the DataFrame (Provides a quick look at the dataset's content).
raw_df.head()


Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


💡 **Observations:**  
- All the column names appear to be readable and meaningful but has spaces between words need Standardising.
- At first glance, the `Doors` column shows inconsistencies such as entries like "04-May," indicating data quality issues. 
- The `Levy` column also contains missing values marked by dashes (-). 
- The `Mileage` column includes values with units like "km," which will need to be cleaned.
- Overall, the dataset features a mix of numerical and categorical variables that will require cleaning before analysis.



In [47]:
# Clean column names
# Remove leading/trailing spaces, convert to lowercase, replace spaces and question marks
# 🍀 Return a new dataframe
df = raw_df.copy()      
df.columns = [col.strip().lower().replace(" ", "_").replace("/", "_").replace(".", "").replace("-", "_").replace("?", "").rstrip("_") for col in raw_df.columns]
print("\nColumn names:")
print(df.columns.tolist())


Column names:
['id', 'price', 'levy', 'manufacturer', 'model', 'prod_year', 'category', 'leather_interior', 'fuel_type', 'engine_volume', 'mileage', 'cylinders', 'gear_box_type', 'drive_wheels', 'doors', 'wheel', 'color', 'airbags']


In [None]:
# Find the dimensions of the dataset
get_data_shape(df)

DataFrame Dimensions
------------------------
Rows   : 19237
Columns: 18



📝 The dataset contains 19,237 rows and 18 columns.

In [49]:
# Prints the name of each column in the dataset, the number of non-null values it contains, and its data type.
def df_info(df):
    table = PrettyTable()
    table.field_names = ["Column", "Non-Null Count", "Dtype"]

    for col in df.columns:
        non_null_count = df[col].count()
        dtype = df[col].dtype
        table.add_row([col, non_null_count, dtype])

    print(table)

df_info(df)

+------------------+----------------+---------+
|      Column      | Non-Null Count |  Dtype  |
+------------------+----------------+---------+
|        id        |     19237      |  int64  |
|      price       |     19237      |  int64  |
|       levy       |     19237      |  object |
|   manufacturer   |     19237      |  object |
|      model       |     19237      |  object |
|    prod_year     |     19237      |  int64  |
|     category     |     19237      |  object |
| leather_interior |     19237      |  object |
|    fuel_type     |     19237      |  object |
|  engine_volume   |     19237      |  object |
|     mileage      |     19237      |  object |
|    cylinders     |     19237      | float64 |
|  gear_box_type   |     19237      |  object |
|   drive_wheels   |     19237      |  object |
|      doors       |     19237      |  object |
|      wheel       |     19237      |  object |
|      color       |     19237      |  object |
|     airbags      |     19237      |  i

💡 **Observations:**  
- All columns have 19,237 non-null values indicating no missing values. However, earlier checks revealed some missing or placeholder values that require further investigation and cleaning.
- Several columns expected to be numeric (like Levy, Mileage, and Engine volume) are currently of type object, which means they may contain non-numeric characters or inconsistent formatting.

In [50]:
# Prints basic statistics (mean, standard deviation, min, max, etc.)
df.describe(include=[np.number])      # For  numeric data types


Unnamed: 0,id,price,prod_year,cylinders,airbags
count,19237.0,19237.0,19237.0,19237.0,19237.0
mean,45576540.0,18555.93,2010.912824,4.582991,6.582627
std,936591.4,190581.3,5.668673,1.199933,4.320168
min,20746880.0,1.0,1939.0,1.0,0.0
25%,45698370.0,5331.0,2009.0,4.0,4.0
50%,45772310.0,13172.0,2012.0,4.0,6.0
75%,45802040.0,22075.0,2015.0,4.0,12.0
max,45816650.0,26307500.0,2020.0,16.0,16.0


💡 **Observations:**  
- `price` has a minimum value of $1 and a maximum that’s over 50 times the mean. Suggests outliers.
- `prod_year` ranges from 1939 to 2020 — likely contains outliers or vintage cars.
- `cylinders` has a Max value of 16, which could be rare or performance vehicles.  
- Some vehicles seems to have 0 `airbags`, which might indicate: Older models, Missing or misreported values.
- `id` is a unique identifier for each row and does not carry predictive value.


In [53]:
# Let's see how many rows with these extreme values
# Production year:
production_year_40 = df[df['prod_year'] < 1980].shape        # More than 40 yo cars

count_40 = production_year_40[0]
percentage_40 = (count_40 / df.shape[0]) * 100

print(f"🎯 Matches found: {count_40} ({percentage_40:.4f}%)")

🎯 Matches found: 23 (0.1196%)


> 💡 Only 23 rows (~0.11%) have a production year before 1980.  
> *🧠 These likely represent vintage or incorrectly entered data and can be safely removed to prevent skewing the analysis.*

In [57]:
# Cylinders:
cylinderes_16 = df[df['cylinders'] > 12].shape

count_16 = cylinderes_16[0]
percentage_16 = (count_16 / df.shape[0]) * 100

print(f"🎯 Matches found: {count_16} ({percentage_16:.4f}%)")


🎯 Matches found: 6 (0.0312%)


>💡 Only 6 rows match -> these likely represent performance vehicles.  
>*🧠 Since the model focuses on the general car market, they should be removed to avoid skewing the analysis.*

In [59]:

# Airbags:
airbags_0 = df[df['airbags'] == 0].shape

count_0 = airbags_0[0]
percentage_0 = (count_0 / df.shape[0]) * 100

print(f"🎯 Matches found: {count_0} ({percentage_0:.4f}%)")

🎯 Matches found: 2405 (12.5019%)


> 💡 2,405  rows (~12.50%) have 0 airbags.  
> *🧠 This is a sizable portion of the data. Further inspection is needed before deciding how to handle them.*