# :oncoming_automobile: Car Price Analytics
## 02 Cleaning

|Field|Description|
|-----|-----------|
|**Author:** |Robert Steven Elliott|
|**Course:** |Code Institute ‚Äì Data Analytics with AI Bootcamp|
|**Project Type:** |Hakathon 1|
|**Team Name:** | tbc |
|**Date:** | November 2025| 

---
### Objectives
- Clean the raw car prices dataset.
- Handle missing values and remove duplicates.
- Encode categorical variables for analysis.
- Prepare the dataset for feature engineering and visualisation.
    - Extract manufacturer from CarName
    - Drop any columns which are not required (car_ID, CarName)
    - turn all fueltype values equal to gas to petrol

### Inputs
- data/raw/car_prices.csv

### Outputs
- data/processed/car_prices.csv

### Additional Comments
Ensure that Notebook 01 was executed successfully before running this notebook. All cleaning steps must be reproducible and clearly documented.

---

## Ignore Warnings

This project uses alot of old library versions so we will turn of future warnings

In [1]:
import warnings # Import the warnings module to manage warning messages
warnings.simplefilter(action='ignore', category=FutureWarning) # Ignore FutureWarning messages

---

## Setup Environment and Change working directory


In [2]:

### This script sets the working directory to the project root

import os # Import the os module to interact with the operating system
PROJECT_ROOT = os.path.join(os.getcwd(), "..") # Adjust as necessary to point to your project root
os.chdir(PROJECT_ROOT) # Change working directory to project root
print("‚úÖ Working directory set to project root:", os.getcwd()) # Confirmation message

‚úÖ Working directory set to project root: /home/robert/Projects/Car-Price-Analytics


Confirm the new current directory

In [3]:
current_dir = os.getcwd() # Get the current working directory
current_dir # Display the current working directory

'/home/robert/Projects/Car-Price-Analytics'

---

## Import Dependencies

In [4]:
import pandas as pd # Import pandas for data manipulation
import numpy as np # Import numpy for numerical operations

---

## Load Dataset


In [5]:
data_path = "data/raw/car_prices.csv" # Define the path to the dataset

try:
    df = pd.read_csv(data_path) # Load the dataset
    print(f"‚úÖ Dataset loaded. Shape: {df.shape}") # Confirmation message with dataset shape
except FileNotFoundError: # Handle the case where the file is not found
    raise FileNotFoundError("‚ùå insurance.csv not found. Place it in data/raw/")

‚úÖ Dataset loaded. Shape: (205, 26)


---

## Handle Missing Values


In [6]:
print("\nüî¢ Missing Values:")
display(df.isna().sum()) # Display the count of missing values per column


üî¢ Missing Values:


car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

There are no missing values.

---

## Remove Duplicates


In [7]:
before = df.shape[0] # Store the number of rows before removing duplicates
df.drop_duplicates(inplace=True) # Remove duplicate rows
after = df.shape[0] # Store the number of rows after removing duplicates
print(f"‚úÖ Removed {before - after} duplicate rows.") # Confirmation message with number of duplicates removed

‚úÖ Removed 0 duplicate rows.


There were no duplicate entries.

---

## Handle Typos

First we will create manufacturer from CarName and Drop CarName.

In [8]:
df["manufacturer"] = df["CarName"].apply(lambda x: x.split(" ")[0]) # Extract manufacturer from CarName
df.drop("CarName", axis=1, inplace=True) # Drop the original CarName column
print("‚úÖ Encoded 'manufacturer' from 'CarName' and dropped 'CarName' column.") # Confirmation message

‚úÖ Encoded 'manufacturer' from 'CarName' and dropped 'CarName' column.


In [10]:
df['manufacturer'] = df['manufacturer'].str.lower() # Convert manufacturer names to lowercase for consistency
for manufacturer in df["manufacturer"].unique(): # Iterate over unique manufacturers
    print(f" - {manufacturer}") # Display unique manufacturers

 - alfa-romero
 - audi
 - bmw
 - chevrolet
 - dodge
 - honda
 - isuzu
 - jaguar
 - maxda
 - mazda
 - buick
 - mercury
 - mitsubishi
 - nissan
 - peugeot
 - plymouth
 - porsche
 - porcshce
 - renault
 - saab
 - subaru
 - toyota
 - toyouta
 - vokswagen
 - volkswagen
 - vw
 - volvo


we can see from the above ist we can see a number of typos for the manufacturer (although 'Nissan' is technically not a typo it will be c). These will no be corrected.

In [11]:
df['manufacturer'] = df['manufacturer'].replace({'maxda': 'mazda',
                                       'porcshce': 'porsche',
                                       'toyouta': 'toyota',
                                       'vokswagen': 'volkswagen',
                                       'vw': 'volkswagen', 
                                       'Nissan': 'nissan'}) # Correct typos in manufacturer names
print("‚úÖ Corrected typos in 'manufacturer' names.") # Confirmation message
for manufacturer in df["manufacturer"].unique(): # Iterate over unique manufacturers
    print(f" - {manufacturer}") # Display unique manufacturers

‚úÖ Corrected typos in 'manufacturer' names.
 - alfa-romero
 - audi
 - bmw
 - chevrolet
 - dodge
 - honda
 - isuzu
 - jaguar
 - mazda
 - buick
 - mercury
 - mitsubishi
 - nissan
 - peugeot
 - plymouth
 - porsche
 - renault
 - saab
 - subaru
 - toyota
 - volkswagen
 - volvo


We now have the correct list of manufacturers.