# Missing Value Analysis & Imputation for Cars Dataset

This notebook performs a full missing‑value workflow:
- Load the dataset
- Inspect missing values
- Visualize missingness (heatmap)
- Visualize distributions (histograms)
- Impute numeric + categorical missing values
- Compare before/after imputation
- Save the cleaned dataset


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
import missingno as msno

## Load the dataset
The file is in the same folder as this notebook.

In [None]:
cars = pd.read_csv("cars.csv", sep=";")
cars.head()

## Inspect missing values
We check structure, data types, and missing counts.

In [None]:
cars.info()

In [None]:
cars.isna().sum()

## Visualize missingness
A heatmap helps reveal patterns in missing values.

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(cars.isna(), cbar=False)
plt.title("Missing Value Heatmap")
plt.show()

## Histograms of numeric columns
These help understand distributions before imputation.

In [None]:
cars.hist(figsize=(14,10), bins=20)
plt.suptitle("Histograms Before Imputation")
plt.show()

## Treat zeros as missing (for MPG and Horsepower)
Zeros in these columns are unrealistic and likely represent missing values.

In [None]:
cols_zero_missing = ["MPG", "Horsepower"]
for col in cols_zero_missing:
    cars[col] = cars[col].replace(0, np.nan)

cars.isna().sum()

## Impute missing values
- Numeric → median
- Categorical → most frequent

In [None]:
numeric_cols = cars.select_dtypes(include=[np.number]).columns
categorical_cols = cars.select_dtypes(exclude=[np.number]).columns

cars_imputed = cars.copy()

num_imputer = SimpleImputer(strategy="median")
cars_imputed[numeric_cols] = num_imputer.fit_transform(cars_imputed[numeric_cols])

cat_imputer = SimpleImputer(strategy="most_frequent")
cars_imputed[categorical_cols] = cat_imputer.fit_transform(cars_imputed[categorical_cols])

cars_imputed.isna().sum()

## Compare distributions before and after imputation

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14,5))
sns.histplot(cars["MPG"], kde=True, ax=axes[0])
axes[0].set_title("MPG Before Imputation")

sns.histplot(cars_imputed["MPG"], kde=True, ax=axes[1])
axes[1].set_title("MPG After Imputation")

plt.show()

## Save cleaned dataset

In [None]:
cars_imputed.to_csv("cars_cleaned.csv", index=False)
cars_imputed.head()