## dataset link 

https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data

https://www.kaggle.com/datasets/sukritchatterjee/used-cars-dataset-cardekho

In [1]:
# Step 1 - Load dataset

import pandas as pd

# Load the vehicles dataset
df = pd.read_csv("dataset/vehicles.csv", 
                 low_memory=False,    # better dtype inference
                 on_bad_lines="skip") # skip problematic rows

# Basic info
print("Shape:", df.shape)
print("\nColumns (first 25):", df.columns.tolist()[:25])
print("\nMemory usage (MB):", round(df.memory_usage(deep=True).sum() / 1024**2, 2))

# Preview first rows
df.head(5)


Shape: (426880, 26)

Columns (first 25): ['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'description', 'county', 'state', 'lat', 'long']

Memory usage (MB): 4310.31


Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,


In [2]:
# Step 2 - Initial Data Quality Analysis

# 1. Data types
print("Data Types:\n", df.dtypes, "\n")

# 2. Missing values (count + %)
missing = df.isna().sum().sort_values(ascending=False)
missing_percent = (missing / len(df) * 100).round(2)
missing_table = pd.DataFrame({"MissingCount": missing, "MissingPercent": missing_percent})
print("Missing Values (Top 15):\n", missing_table.head(15), "\n")

# 3. Duplicate rows
dup_count = df.duplicated().sum()
print(f"Duplicate rows: {dup_count}")


Data Types:
 id                int64
url              object
region           object
region_url       object
price             int64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
VIN              object
drive            object
size             object
type             object
paint_color      object
image_url        object
description      object
county          float64
state            object
lat             float64
long            float64
posting_date     object
dtype: object 

Missing Values (Top 15):
               MissingCount  MissingPercent
county              426880          100.00
size                306361           71.77
cylinders           177678           41.62
condition           174104           40.79
VIN                 161042           37.73
drive               130567           30.59
paint_color         