# 02 – Preprocessing & Feature Engineering
### Car Price Prediction Using Machine Learning
Group Assignment 02 - CCS3012 - Data Analytics  
Submission Date: 16th September 2025

---

### **Group 11**
-  **FC211034 - N.D. Samararathne Kodikara**
-  **FC211013 - N.W.V. Tharindu Pabasara**
-  **FC211025 - W.M.M.C.B. Wijesundara**



---

### **Supervisor**
**Ms. Dilmi Praveena**  
*Faculty of Computing*  
*University of Sri Jayewardenepura*

---



## 📌 Objectives

This notebook builds upon the cleaned dataset produced in **Notebook 01 — Data Exploration & Cleaning**.  
The focus here is on understanding the dataset more deeply, validating insights statistically, and preparing features for modeling.  

---

###  1. Descriptive Analytics  
- Summarize numerical variables.  
- Explore categorical variables.  
- Visualize distributions.  
- Examine bivariate relationships.  

###  2. Inferential Analytics  
- Perform hypothesis testing to assess whether differences between groups are statistically significant.  
- Check correlation strength and direction.  
- Identify potential multicollinearity issues between predictors.  

###  3. Feature Engineering  
- Create new features.  
- Extract useful info from text-based or categorical columns.  
- Handle skewed features.  
- Generate interaction terms if useful.  

###  4. Preprocessing Setup  
- Encode categorical variables.  
- Normalize/scale numerical features. 
- Standardize target variable if needed.  

###  5. Train/Test Split  
- Split dataset into **training** and **testing** sets for unbiased model evaluation.  
- Save processed datasets and transformation pipeline for **Notebook 03 — Modeling**.  



> By the end of this notebook, we will have a **fully processed dataset** with engineered and encoded features, ready for predictive modeling.  

---


### 📂 Input  
- `clean_data.csv` saved in `Data/processed/`  


### 📦 Output  


---

### 📊 Dataset Overview
**Dataset:** Car price dataset cleaned.  
**columns include**:   
- `id`, `price` (target variable),  `levy`,  `manufacturer`,  `model`,  `prod_year`,  `category`,  `leather_interior`,  `fuel_type`, `engine_volume`,  `mileage`,  `cylinders`,  `gear_box_type`,  `drive_wheels`,  `doors`,  `wheel`, `color`,  `airbags`,  `turbo` (derived in Notebook 01)


> **Dataset stats:** 19,237 rows × 19 features | Target variable: `price` | Problem type: Regression


## Setup & imports

In [6]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Statistical functions
from scipy import stats

from prettytable import PrettyTable # For creating formatted tables in the console.
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)


In [7]:
# Next we load the cleaned dataset.
df = pd.read_csv("./Data/processed/clean_data.csv")

In [8]:
# Prints the first 5 rows of the DataFrame.
df.head()

Unnamed: 0,id,price,levy,manufacturer,model,prod_year,category,leather_interior,fuel_type,engine_volume,mileage,cylinders,gear_box_type,drive_wheels,doors,wheel,color,airbags,turbo
0,45654403,13328,1399.0,lexus,rx 450,2010,jeep,yes,hybrid,3.5,186005,6,automatic,4x4,04-may,left wheel,silver,12,False
1,44731507,16621,1018.0,chevrolet,equinox,2011,jeep,no,petrol,3.0,192000,6,tiptronic,4x4,04-may,left wheel,black,8,False
2,45774419,8467,0.0,honda,fit,2006,hatchback,no,petrol,1.3,200000,4,variator,front,04-may,right-hand drive,black,2,False
3,45769185,3607,862.0,ford,escape,2011,jeep,yes,hybrid,2.5,168966,4,automatic,4x4,04-may,left wheel,white,0,False
4,45809263,11726,446.0,honda,fit,2014,hatchback,yes,petrol,1.3,91901,4,automatic,front,04-may,left wheel,silver,4,False


In [9]:
# Prints the name of each column in the dataset, the number of non-null values it contains, and its data type.
def df_info(df):
    table = PrettyTable()
    table.field_names = ["Column", "Non-Null Count", "Dtype"]

    for col in df.columns:
        non_null_count = df[col].count()
        dtype = df[col].dtype
        table.add_row([col, non_null_count, dtype])

    print(table)

df_info(df)

+------------------+----------------+---------+
|      Column      | Non-Null Count |  Dtype  |
+------------------+----------------+---------+
|        id        |     15697      |  int64  |
|      price       |     15697      |  int64  |
|       levy       |     15697      | float64 |
|   manufacturer   |     15697      |  object |
|      model       |     15697      |  object |
|    prod_year     |     15697      |  int64  |
|     category     |     15697      |  object |
| leather_interior |     15697      |  object |
|    fuel_type     |     15697      |  object |
|  engine_volume   |     15697      | float64 |
|     mileage      |     15697      |  int64  |
|    cylinders     |     15697      |  int64  |
|  gear_box_type   |     15697      |  object |
|   drive_wheels   |     15697      |  object |
|      doors       |     15697      |  object |
|      wheel       |     15697      |  object |
|      color       |     15697      |  object |
|     airbags      |     15697      |  i

💡 **Observations:**  
- The data set appears unchanged and matches the previously saved version.

Le't start...

# Section 1: Descriptive Analytics

## Numerical Features
**Features to analyze: price, levy, mileage, engine_volume, cylinders, airbags, prod_year**

### Analyze `price`

In [10]:
# Summary statistics
price_stats = df['price'].describe()
print("Price Summary:\n", price_stats)

Price Summary:
 count    1.569700e+04
mean     2.028445e+04
std      2.108237e+05
min      1.000000e+00
25%      7.527000e+03
50%      1.426900e+04
75%      2.338900e+04
max      2.630750e+07
Name: price, dtype: float64
