# Data Mining Project: Used Car Sales Analysis## 1. Domain & Introduction**Domain:** Automotive Sales / E-commerce**Objective:** The primary goal of this project is to apply data mining techniques to a dataset of used car sales. By analyzing features such as price, mileage, manufacturing year, and fuel type, we aim to uncover hidden patterns, segment the market into distinct clusters, identify pricing anomalies, and discover association rules that can govern sales strategies.

In [None]:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.ensemble import IsolationForestfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansfrom mlxtend.frequent_patterns import apriori, association_rulesimport warningswarnings.filterwarnings('ignore')sns.set_style("whitegrid")plt.rcParams['figure.figsize'] = (10, 6)

## 2. Dataset Description- **Source:** The dataset `used_car_sales.csv`.- **Key Features:** Price, Year, Mileage, Fuel, Gearbox, Car Type, Engine Power.- **Goal:** To prepare this data for analysis by cleaning, filtering, and transforming it.

In [None]:
# Load datadf = pd.read_csv("used_car_sales.csv")# 1. Column Standardizationdf.columns = df.columns.str.strip()print("Original Columns:", df.columns.tolist())# 2. Feature Selectionselected_columns = [    "Price-$", "Manufactured Year", "Mileage-KM", "Energy",    "Gearbox", "Car Type", "Engine Power-HP"]df = df[selected_columns]# Rename for easier accessdf.rename(columns={    "Price-$": "price",    "Manufactured Year": "year",    "Mileage-KM": "mileage",    "Energy": "fuel",    "Gearbox": "gearbox",    "Car Type": "car_type",    "Engine Power-HP": "engine_hp"}, inplace=True)# 3. Data Filtering & Cleaning# Filter logic: Price > 0, Mileage > 0, Year > 1990initial_shape = df.shapedf = df[df["price"] > 0]df = df[df["mileage"] > 0]df = df[df["year"] > 1990]# 4. Drop Duplicates and NAsdf.drop_duplicates(inplace=True)df.dropna(inplace=True)print(f"Shape after cleaning: {df.shape} (Removed {initial_shape[0] - df.shape[0]} rows)")display(df.head())

## 3. Method: Outlier Detection**Technique:** Isolation Forest**Goal:** Identify anomalies in the data (e.g., cars with unusual price-to-mileage ratios).

In [None]:
# Isolation Forest for Anomaly Detection# We use Price and Mileage as key indicators for anomaliesiso = IsolationForest(contamination=0.05, random_state=42)df['outlier_status'] = iso.fit_predict(df[['price', 'mileage', 'year']])# -1 indicates outlier, 1 indicates normaloutliers = df[df['outlier_status'] == -1]normal = df[df['outlier_status'] == 1]print(f"Detected {len(outliers)} outliers.")# Visualization: Outliersplt.figure(figsize=(10, 6))sns.scatterplot(data=df, x='mileage', y='price', hue='outlier_status', palette={1: 'blue', -1: 'red'})plt.title('Anomaly Detection: Price vs Mileage')plt.xlabel('Mileage (KM)')plt.ylabel('Price ($)')plt.legend(title='Status', labels=['Normal', 'Outlier'])plt.show()# Remove outliers for further analysisdf_clean = df[df['outlier_status'] == 1].drop('outlier_status', axis=1)

## 4. Method: Clustering**Technique:** K-Means Clustering**Goal:** Segment the cars into distinct market groups based on their features.

In [None]:
# Feature Preparation for Clustering# One-Hot Encoding for categorical featurescluster_data = pd.get_dummies(df_clean, columns=['fuel', 'gearbox', 'car_type'], drop_first=True)# Scaling numerical featuresscaler = StandardScaler()scaled_data = scaler.fit_transform(cluster_data)# K-Means with k=5kmeans = KMeans(n_clusters=5, random_state=42)df_clean['cluster'] = kmeans.fit_predict(scaled_data)# Analyze Cluster Centers (Mean values)cluster_summary = df_clean.groupby('cluster')[['price', 'year', 'mileage', 'engine_hp']].mean()print("Cluster Summary:")display(cluster_summary)# Visualization: Clustersplt.figure(figsize=(10, 6))sns.scatterplot(data=df_clean, x='mileage', y='price', hue='cluster', palette='viridis', s=100)plt.title('Car Market Segments (K-Means Clustering)')plt.xlabel('Mileage (KM)')plt.ylabel('Price ($)')plt.show()

## 5. Method: Association Rule Mining**Technique:** Apriori Algorithm**Goal:** Discover interesting relationships between car attributes.

In [None]:
# Discretization for Association Rules# We need categorical data for Apriori. We bin continuous variables.ar_data = df_clean.copy()ar_data['price_bin'] = pd.qcut(ar_data['price'], q=3, labels=['Low Price', 'Mid Price', 'High Price'])ar_data['mileage_bin'] = pd.qcut(ar_data['mileage'], q=3, labels=['Low Mileage', 'Mid Mileage', 'High Mileage'])ar_data['year_bin'] = pd.cut(ar_data['year'], bins=[1990, 2010, 2018, 2025], labels=['Old', 'Modern', 'New'])# Select features for rulesbasket = ar_data[['price_bin', 'mileage_bin', 'year_bin', 'fuel', 'gearbox', 'car_type']]basket_encoded = pd.get_dummies(basket)# Apriorifrequent_itemsets = apriori(basket_encoded, min_support=0.05, use_colnames=True)rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)# Filter for interesting rules (features implying price)price_rules = rules[rules['consequents'].astype(str).str.contains('Price')]price_rules = price_rules.sort_values('lift', ascending=False).head(10)print("Top 5 Association Rules pointing to Price Segments:")display(price_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head())# Visualization: Association Rules Scatterplt.figure(figsize=(10, 6))sns.scatterplot(data=price_rules, x='support', y='confidence', size='lift', hue='lift', sizes=(100, 400))plt.title('Association Rules: Support vs Confidence (Size=Lift)')plt.show()

## 6. ConclusionIn this analysis, we successfully:1.  **cleaned** the data to ensure quality.2.  **Identified anomalies** using Isolation Forest, highlighting potentially overpriced or suspicious listings.3.  **Segmented the market** into 5 distinct clusters using K-Means, which can help in targeted marketing strategies.4.  **Discovered rules** using Apriori that link car features (like 'Automatic' or 'SUV') to Price categories.These insights provide a data-driven foundation for understanding the used car market.