# CS5228-2410 Final Project EDA

## Overview

*(from [Kaggle Project](https://www.kaggle.com/competitions/cs-5228-2410-final-project))*

In this project, we look into the market for used cars in Singapore. Car ownership in Singapore is rather expensive which includes the very high prices for new and used cars (compared to many other countries). There are many stakeholders in this market. Buyers and sellers want to find good prices, so they need to understand what affects the value of a car. Online platforms facilitating the sale of used cars, on the other hand, want to maximize the number of sales/transactions.

The goal of this task is to predict the resale price of a car based on its properties (e.g., make, model, mileage, age, power, etc). It is therefore first and foremost a regression task. These different types of information allow you to come up with features for training a regressor. It is part of the project for you to justify, derive and evaluate different features. Besides predicting the outcome in terms of a dollar value, other useful results include the importance of different attributes, the evaluation and comparison of different regression techniques, an error analysis and discussion about limitations and potential extensions, etc.

## Main Steps
1. Load dataset
2. Initial inspection
    - Basic information
    - Summary stats
3. Data quality check
    - Missing values
    - Identify duplicates
    - Examine data types
4. EDA
    - Univariate analysis
        - Numerical: histograms, boxplots
        - Categorical: barplots, pie charts
    - Bivariate analysis
        - Scatterplots: price v numerical
        - Boxplots: price v categorical
    - Correlation analysis

In [240]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from wordcloud import WordCloud
import difflib
import dateutil
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

In [241]:
pd.set_option('display.max_columns', None)

In [242]:
if not os.path.exists('visualisations/'):
    os.mkdir('visualisations')

## 1    Load data

In [243]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

## 2    Initial Inspection

In [None]:
print("# of records: ",len(df_train))
print("# of columns: ",len(df_train.columns))

In [None]:
print("# of records: ",len(df_test))
print("# of columns: ",len(df_test.columns))

In [None]:
df_train.info()

In [None]:
print(df_train.head(3))

## 3    Data Quality Check

In [None]:
# Calculate missing values
def get_missing_info(df_train):
    missing_values = df_train.isnull().sum()
    missing_percentages = (missing_values / len(df_train) * 100).round(2)

    missing_info = pd.DataFrame({
        'missing_count': missing_values,
        'missing_perc': missing_percentages
    }).sort_values(by="missing_count", ascending=False)

    return missing_info

missing_info = get_missing_info(df_train=df_train)
print("\n=== missing_count Analysis ===")
print(missing_info[missing_info['missing_count'] > 0].sort_values('missing_perc', ascending=False))


The dataset has significant missing data across several features:

- 'indicative_price' is completely missing (100%)
- 'opc_scheme' and 'original_reg_date' are missing for nearly all records (>98%)
- 'lifespan' is missing for about 90% of the records
- 'fuel_type' is missing for about 76% of the records
- 'mileage' is missing for about 21% of the records
- Several other fields have missing data ranging from 0.03% to 15.25%


In [None]:
print("\n=== Duplicate Records Analysis ===")
print(f"Number of duplicate rows: {df_train.duplicated().sum()}")
print(f"Number of duplicate listing_ids: {df_train['listing_id'].duplicated().sum()}")


There are no duplicate rows or listing_ids in the dataset.

In [None]:
df_train.sample(3)

**Observations**
- Since `indicative_price` is completely missing, drop column. 
- `listing_id` is not meaningful to the analysis as well. 
- Since `original_reg_date` is almost entirely missing, feature is not meaningful. Based on context, last `reg_date` may be more useful as it may be closely related to the COE price upon time of registration. COE has a heavy influence on car resale price.
- `fuel_type` has many missing values but can potentially be obtained from `category`.
- `lifespan` has many missing values but can potentially be inferred from the `title`.

In [None]:
COLS_TO_DROP = ['indicative_price']

df_train = df_train.drop(columns=COLS_TO_DROP)
df_train.columns

#### Basic Sanity Check

In [None]:
# Check model make quality
unique_makes = df_train['make'].unique()
unique_models = df_train['model'].unique()

make_model_pairs = df_train.groupby('make')['model'].apply(set)
model_count = df_train['model'].value_counts()
make_count = df_train['make'].value_counts()

# 4. Check for missing values
missing_makes = df_train['make'].isnull().sum()
missing_models = df_train['model'].isnull().sum()

# 5. Text normalization
df_train['make'] = df_train['make'].str.strip().str.title()  # Example normalization
df_train['model'] = df_train['model'].str.strip().str.title()

# Output the results
print(f"Unique Makes: {unique_makes}")
print(f"Unique Models: {unique_models}")
print("Make-Model Pairs:")
print(make_model_pairs)
print(f"Missing Makes: {missing_makes}, Missing Models: {missing_models}")
print(model_count[:10])
print(make_count[:10])

In [None]:
# Check manufactured year
# convert year to type int
df_train['manufactured'] = pd.to_numeric(df_train['manufactured'], errors='coerce').astype('Int64')
print(df_train['manufactured'].describe())

# Check for future years
current_year = pd.Timestamp.now().year
if (df_train['manufactured'] > current_year).any():
    print("There are future years of manufacture")
else:
    print("There are no future years of manufacture")

In [None]:
# Convert date columns to datetime format (assuming they are in string format)
df_train['original_reg_date'] = pd.to_datetime(df_train['original_reg_date'], errors='coerce')
df_train['reg_date'] = pd.to_datetime(df_train['reg_date'], errors='coerce')

# Check for Missing Values
missing_original_reg_date = df_train['original_reg_date'].isnull().sum()
missing_reg_date = df_train['reg_date'].isnull().sum()

# Check Data Types
original_reg_date_type = df_train['original_reg_date'].dtype
reg_date_type = df_train['reg_date'].dtype

# Check Validity of Dates
future_dates_original_reg = df_train[df_train['original_reg_date'] > pd.Timestamp.now()]
future_dates_reg = df_train[df_train['reg_date'] > pd.Timestamp.now()]

# Check Logical Consistency
inconsistent_dates = df_train[df_train['reg_date'] < df_train['original_reg_date']]

# Output results
print(f"Missing values in 'original_reg_date': {missing_original_reg_date}")
print(f"Missing values in 'reg_date': {missing_reg_date}")
print(f"Data type of 'original_reg_date': {original_reg_date_type}")
print(f"Data type of 'reg_date': {reg_date_type}")

if not future_dates_original_reg.empty:
    print(f"Future dates found in 'original_reg_date': {future_dates_original_reg.shape[0]} entries")
else:
    print("No future dates found in 'original_reg_date'.")

if not future_dates_reg.empty:
    print(f"Future dates found in 'reg_date': {future_dates_reg.shape[0]} entries")
else:
    print("No future dates found in 'reg_date'.")

if not inconsistent_dates.empty:
    print(f"Inconsistent dates found: {inconsistent_dates.shape[0]} entries where 'reg_date' is earlier than 'original_reg_date'.")
else:
    print("All dates are consistent: 'reg_date' is not earlier than 'original_reg_date'.")


In [None]:
# Check for Missing Values
missing_vehicle_type = df_train['type_of_vehicle'].isnull().sum()
missing_transmission = df_train['transmission'].isnull().sum()

# Check Unique Values
unique_vehicle_types = df_train['type_of_vehicle'].unique()
unique_transmissions = df_train['transmission'].unique()

# Check for Consistency
# Define expected categories
expected_vehicle_types = ['suv', 'luxury sedan', 'mpv', 'mid-sized sedan', 'sports car', 'truck', 'hatchback', 'stationwagon', 'bus/mini bus', 'van']

# Check if unique values are in expected categories
inconsistent_vehicle_types = [v for v in unique_vehicle_types if v not in expected_vehicle_types]

# Output results
print(f"Missing values in 'type_of_vehicle': {missing_vehicle_type}")
print(f"Missing values in 'transmission': {missing_transmission}")

print(f"Unique values in 'type_of_vehicle': {unique_vehicle_types}")
print(f"Unique values in 'transmission': {unique_transmissions}")

if inconsistent_vehicle_types:
    print(f"Inconsistent vehicle types found: {inconsistent_vehicle_types}")
else:
    print("All vehicle types are consistent with expected categories.")

# Check that vehicle type is consistent with all model-make pairs
unique_vehicle_count = (df_train
                        .groupby(['make', 'model'])['type_of_vehicle']
                        .nunique()
                        .reset_index(name='unique_count'))

# Step 2: Filter to find combinations with more than one unique vehicle type
multiple_types = unique_vehicle_count[unique_vehicle_count['unique_count'] > 1]

# Step 3: Check if there are any such combinations
if not multiple_types.empty:
    print("The following model and make combinations map to more than one vehicle type:")
    print(multiple_types)
else:
    print("All model and make combinations map to a single vehicle type.")

###    Textual Features Extraction

In [None]:
categorical_columns = df_train.select_dtypes(include=["object"]).columns
categorical_columns

In [257]:
text_features = ['title', 'description', 'category', 'features', 'accessories', 'opc_scheme', 'eco_category']

In [None]:
# Create a word cloud for each text feature
for feature in text_features:
    text_data = ' '.join(df_train[feature].dropna().tolist())
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data)

    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')  
    plt.title(f'Wordcloud – {feature}')
    plt.savefig(f'visualisations/wordcloud_{feature}.png')
    plt.show()

### Data Cleaning & Data Transformation

In [259]:
## Create binary variables from text features

# Convert opc scheme to binary for further analysis
df_train['opc_scheme'] = df_train['opc_scheme'].apply(lambda x: 1 if pd.notna(x) else 0)
# Create column for parf v coe cars 
df_train['parf'] = df_train['category'].apply(lambda x: 1 if 'parf' in x else 0)
# Create column for rare & exotic cars
df_train['rare'] = df_train['category'].apply(lambda x: 1 if 'rare & exotic' in x else 0)
# Create column for vintage cars
df_train['vintage'] = df_train['category'].apply(lambda x: 1 if 'vintage' in x else 0)

In [None]:
binary_columns = ['opc_scheme', 'parf', 'rare', 'vintage']
for col in binary_columns:
    print(f"\n{col}:")
    print(df_train[col].value_counts().head())

In [None]:
print("\n=== Fix Missing Values (make) ===")
print("Missing values: ",df_train['make'].isna().sum())


df_train['make']          = df_train['make'].str.upper()
df_train['make']          = df_train['make'].str.replace(' ','').str.strip()
df_train['title']         = df_train['title'].str.upper()

make_list                   = [make for make in df_train['make'].unique().tolist() if type(make) == str]
make_list.sort()
print("Unique values: ")
print(make_list[:20])
print()

df_train['make_temp']     = df_train['title'].str.split(' ').str[0]
df_train['make_temp_similar']     = df_train.apply(lambda x: difflib.get_close_matches(x['make_temp'], make_list, n=1)[0], axis=1)

df_train['make']          = df_train['make'].fillna(df_train['make_temp'])
df_train                  = df_train.drop(columns = ['make_temp', 'make_temp_similar'])
print("Missing values (after cleaning): ", df_train['make'].isna().sum())

In [None]:
print("\n=== Fix Missing Values (manufactured) ===")
print("Missing values: ",df_train['manufactured'].isna().sum())

df_train['original_reg_date']         = pd.to_datetime(df_train['original_reg_date'], format = "%d-%b-%Y")
df_train['reg_date']                  = pd.to_datetime(df_train['reg_date'], format = "%d-%b-%Y")

df_train['original_reg_date_temp']    = df_train['original_reg_date'].dt.year
df_train['reg_date_temp']             = df_train['reg_date'].dt.year
df_train['manufactured']              = df_train['manufactured'].fillna(df_train[['original_reg_date_temp','reg_date_temp']].min(axis=1))
df_train['manufactured']              = df_train['manufactured'].astype(int).astype(str)
df_train                              = df_train.drop(columns = ['original_reg_date_temp', 'reg_date_temp'])
print("Missing values (after cleaning): " ,df_train['manufactured'].isna().sum())


In [None]:
print("\n=== Fix Missing Values (no_of_owners) ===")
print("Missing values: ",df_train['no_of_owners'].isna().sum())
print("\nSummary statistics: ", df_train['no_of_owners'].describe())
df_train['no_of_owners'] = df_train['no_of_owners'].fillna(df_train['no_of_owners'].median())
print("\nMissing values (after cleaning): " ,df_train['no_of_owners'].isna().sum())

In [None]:
# Extract fuel type from category
fuel_keywords = {
    'electric': 'electric',
    'hybrid': 'petrol-electric'
}

def extract_fuel_type(category_text):
    category_text = category_text.lower()  
    for keyword, fuel_type in fuel_keywords.items():
        if keyword in category_text:
            return fuel_type
    return None

# Apply the function to the rows where fuel_type is missing
df_train['fuel_type_category_fill'] = df_train['fuel_type']
print(f"Number of missing values for fuel_type (before): {df_train['fuel_type_category_fill'].isna().sum()}")
df_train.loc[df_train['fuel_type'].isna(), 'fuel_type_category_fill'] = df_train['category'].apply(extract_fuel_type)
print(f"Number of missing values for fuel_type (after): {df_train['fuel_type_category_fill'].isna().sum()}")


In [None]:
# Create mapping for fuel type from model make
fuel_type_mapping = df_train.groupby(['make', 'model'])['fuel_type'].agg(lambda x: x.mode()[0] if not x.mode().empty else None).reset_index()
fuel_type_dict = dict(zip(zip(fuel_type_mapping['make'], fuel_type_mapping['model']), fuel_type_mapping['fuel_type']))

# Define a function to fill missing fuel types
def fill_fuel_type(row, fuel_type_dict):
    if pd.isna(row['fuel_type']):
        return fuel_type_dict.get((row['make'], row['model']), None)
    return row['fuel_type']

# Apply the function to fill in missing values
df_train['fuel_type_model_make_fill'] = df_train['fuel_type']
print(f"Number of missing values for fuel_type (before): {df_train['fuel_type_model_make_fill'].isna().sum()}")
df_train.loc[df_train['fuel_type'].isna(), 'fuel_type_model_make_fill']  = df_train.apply(fill_fuel_type, axis=1, fuel_type_dict=fuel_type_dict)
print(f"Number of missing values for fuel_type (after): {df_train['fuel_type_model_make_fill'].isna().sum()}")

df_train['fuel_type'] = df_train['fuel_type_model_make_fill']


In [266]:
df_train['AGE-current'] = datetime.now().year - df_train['manufactured'].astype(int)

"""Calculate the remaining years of COE based on the vehicle title and registration dates."""
df_train['title'] = df_train['title'].str.upper()  # Convert titles to uppercase for consistency

# Extract COE expiration year from title if it contains 'COE'
df_train.loc[df_train['title'].str.contains('COE'), 'coe_temp'] = df_train['title'].str.split(' ').str[-1]
df_train['coe_temp1'] = df_train['coe_temp'].str.replace(')', '')  # Clean up extracted year

# Convert the cleaned COE year to datetime
df_train['coe_temp1'] = pd.to_datetime(df_train['coe_temp1'], format="%m/%Y", errors='coerce')

# Calculate the COE end date based on the latest registration date
df_train['coe_temp2'] = df_train[['original_reg_date', 'reg_date']].max(axis=1) + pd.offsets.DateOffset(years=10)
df_train['coe_enddate'] = df_train['coe_temp1'].fillna(df_train['coe_temp2'])  # Use COE date if available

# Calculate remaining age until COE expiration
df_train['AGE-remaining'] = df_train.apply(lambda x: dateutil.relativedelta.relativedelta(x['coe_enddate'], datetime.now()).years, axis=1)

# Clean up temporary columns used for calculations
df_train = df_train.drop(columns=["coe_temp", "coe_temp1", "coe_temp2", "coe_enddate"])


In [267]:
# Extract coe end date from title
df_train['coe_end'] = df_train['title'].str.extract(r'\(COE TILL (\d{2}/\d{4})\)')
df_train['coe_end'] = pd.to_datetime(df_train['coe_end'], format='%m/%Y')

In [None]:
missing_info = get_missing_info(df_train=df_train)
missing_info

## 5 EDA

### Univariate Analysis

#### Numerical Analysis

In [None]:
# Identify numerical columns
numerical_cols = df_train.select_dtypes(include=['int64', 'float64']).columns.drop(['listing_id'])

print("\n=== Numerical Features Analysis ===")
print(df_train[numerical_cols].describe())

# Check for outliers using IQR method
print("\n=== Outlier Analysis (IQR Method) ===")
for col in numerical_cols:
    Q1 = df_train[col].quantile(0.25)
    Q3 = df_train[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df_train[(df_train[col] < (Q1 - 1.5 * IQR)) | (df_train[col] > (Q3 + 1.5 * IQR))][col]
    if len(outliers) > 0:
        print(f"\n{col}:")
        print(f"Number of outliers: {len(outliers)}")
        print(f"Percentage of outliers: {(len(outliers)/len(df_train)*100):.2f}%")

fig, axes = plt.subplots(len(numerical_cols), 2, figsize=(10, 2*len(numerical_cols)))

for i, feature in enumerate(numerical_cols):
    
    # Histogram
    sns.histplot(df_train[feature].dropna(), kde=True, ax=axes[i, 0])
    axes[i, 0].set_title(f'Histogram of {feature}')
    axes[i, 0].set_xlabel(feature)
    
    # Box plot
    sns.boxplot(x=df_train[feature].dropna(), ax=axes[i, 1])
    axes[i, 1].set_title(f'Box Plot of {feature}')
    axes[i, 1].set_xlabel(feature)

plt.tight_layout()
plt.savefig("visualisations/distribution_numerical.png", bbox_inches='tight')
plt.show()


In [None]:
"""Plot correlation matrix of numerical features"""
corr = df_train[numerical_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.savefig("visualisations/correlation_matrix.png")
plt.show()

In [None]:
print(corr)

**Strongest correlations with price**
- dereg_value (0.91)
- arf (0.89)
- omv (0.82)
- depreciation (0.81)
- power (0.70)

Focus on these features for initial modelling effort.


**Moderate correlations with price**
- road_tax (0.52)
- engine_cap (0.44)
- coe (0.35)
- mileage (-0.39)
- rare (0.60)

**Low correlation with price**
- manufactured (0.20)
- curb_weight (0.15)
- no_of_owners (-0.08)

**Strong correlations between features**

- omv and arf (0.94)
- engine_cap and road_tax (0.94)
- power and engine_cap (0.86)

Potential multicollinearity which may impact some models (tree-based models are less sensitive). Choose most relevant features or employ dimensionality reduction methods like PCA.
Feature engineering to create interaction terms between strongly related features (e.g. power * engine cap)






In [272]:
strong_corr_price = ['dereg_value', 'arf', 'omv', 'depreciation', 'power']
moderate_corr_price = ['road_tax', 'engine_cap', 'coe', 'mileage', 'rare', 'AGE-remaining']

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=strong_corr_price + ['price'], hue='parf')
plt.title('Features with strong corr with Price Pairplot (by parf)')
plt.savefig("visualisations/pairplot_strong_corr_by_parf.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=strong_corr_price+['price'], hue='rare')
plt.title('Features with strong corr with Price Pairplot (by rare)')
plt.savefig("visualisations/pairplot_strong_corr_by_rare.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=moderate_corr_price + ['price'], hue='parf')
plt.title('Features with strong corr with Price Pairplot (by parf)')
plt.savefig("visualisations/pairplot_mod_corr_by_parf.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=moderate_corr_price+['price'], hue='rare')
plt.title('Features with strong corr with Price Pairplot (by rare)')
plt.savefig("visualisations/pairplot_mod_corr_by_rare.png")
plt.show()

#### Categorical Analysis

In [None]:
# Identify categorical columns
categorical_cols = df_train.select_dtypes(include=['object']).columns

print("\n=== Categorical Features Analysis ===")
for col in categorical_cols:
    unique_values = df_train[col].nunique()
    print(f"\n{col}:")
    print(f"Number of unique values: {unique_values}")
    if unique_values < 10:  # Only show value counts for columns with few unique values
        print(df_train[col].value_counts().head())


**Observations**

- eco_category has the same value for all rows. Drop column.
- opc_scheme has 3 unique values. Present value seem to indicate that the car is under OPC scheme. Convert to binary.


#### Categorical Analysis (with added binary features)

In [None]:
print(df_train.columns)
print(categorical_cols)
print(binary_columns)

In [None]:
categorical_and_binary_cols = list(categorical_cols) + binary_columns
categorical_and_binary_cols

In [None]:
MAX_UNIQUE_VALUES = 30
for feature in categorical_and_binary_cols:
    if len(df_train[feature].value_counts()) < MAX_UNIQUE_VALUES:
        plt.figure(figsize=(8, 4))
        sns.countplot(x=feature, data=df_train, order=df_train[feature].value_counts().index, palette="Set3")
        plt.title(f'Distribution of {feature}')
        plt.tight_layout()
        plt.savefig(f'visualisations/distribution_categorical_{feature}.png')
        plt.show()

        
        # Relationship with price
        plt.figure(figsize=(8, 4))
        sns.boxplot(x=feature, y='price', data=df_train, palette="Set3")
        plt.title(f'{feature} vs Price')
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.savefig(f'visualisations/price_vs_{feature}.png')
        plt.show()



**Observations**
- Type of vehicle:
    - SUVs, luxury sedans, and sports cars are the most common vehicle types in the dataset. - SUVs are highly represented, with relatively lower price variability compared to luxury sedans and sports cars.
    - Truck, station wagon, and bus/mini bus have lower counts, indicating that these vehicle types are less represented in this dataset, possibly niche categories. Generally lower median prices.
    - Luxury sedans and sports cars show the highest resale prices, as expected given the premium nature of these vehicles. There are several outliers in these categories, indicating that some cars are priced significantly higher than the rest.
    - SUVs have a relatively wide price range, with a number of outliers in the upper range, possibly high-end or luxury SUVs. 
    
- Tranmission type:
    - The majority of vehicles in the dataset have automatic transmission.
    - Manual transmission vehicles tend to have a lower price range overall compared to automatics. However, there are still some outliers with higher prices, which may correspond to specific models that are rarer or in high demand among enthusiasts (e.g., sports cars with manual transmission).

- Fuel type:
    - The most common fuel type is diesel, followed by petrol-electric and petrol. This suggests a significant preference for diesel vehicles in the dataset.
    - Electric vehicles are less common, while diesel-electric vehicles have the lowest count among the listed fuel types.
    - Petrol-electric vehicles also show a significant range in pricing, with some outliers that could represent premium models. This suggests that hybrid vehicles are valued well in the resale market.
    - Electric vehicles have a narrower price range compared to diesel and petrol-electric, suggesting that the market for used electric vehicles may not be as established yet, potentially affecting their resale value.
    - Note: Many missing values for this feature
- opc_scheme:
    - Since the feature is highly imbalanced, box plot is not very useful
- parf:
    - parf cars have a slightly higher range compared to coe cars. 
- rare:
    - Rare cars can command much higher resale prices. 

In [281]:
date_features = ['original_reg_date','reg_date', 'lifespan', 'coe_end']

In [None]:
print("\n=== Date Fields Analysis ===")
for col in date_features:

    print(f"\n{col}:")
    print(f"Unique values sample: {df_train[col].unique()[:5]}")
    # Check for invalid dates
    try:
        df_train[col] = pd.to_datetime(df_train[col])
        print("Min date:", df_train[col].min())
        print("Max date:", df_train[col].max())
    except:
        print("Error converting to datetime - possible invalid date formats")


In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(df_train['price'])
plt.title('Boxplot of Price to detect outliers')
plt.show()



In [None]:
df_train[df_train['price']==max(df_train['price'])]

In [None]:
# df_train.groupby(['make','model'])['price'].describe().sort_values(by='count', ascending=False).to_csv('price_model_make.csv')

df_price_breakdown = df_train.groupby(['make','model'])['price'].describe()
df_price_breakdown

# Handle Missing Values Numerical

In [None]:
! pip install category_encoders

In [287]:
import category_encoders as ce
from sklearn.impute import KNNImputer

In [288]:

def ENCODE_transmission(df):
    
    df['TRANSMISSION-manual'] = pd.get_dummies(df['transmission'], drop_first=True, dtype=int)
    df                        = df.drop(columns = ['transmission'])
    
    return (df)

def ENCODE_vehtype(df):
    
    encoder             = ce.BinaryEncoder(cols='type_of_vehicle',return_df=True)
    df_temp             = encoder.fit_transform(df['type_of_vehicle']) 
    df_temp.columns     = [col.replace('type_of_vehicle_', "TYPE-binenc") for col in df_temp.columns] 

    df                  = pd.concat([df, df_temp], axis = 1)
    df                  = df.drop(columns = 'type_of_vehicle')

    return df

def ENCODE_make(df):

    encoder             = ce.BinaryEncoder(cols='make',return_df=True)
    df_temp             = encoder.fit_transform(df['make']) 
    df_temp.columns     = [col.replace('make_', "MAKE-binenc") for col in df_temp.columns] 

    df = pd.concat([df, df_temp], axis =1)
    df = df.drop(columns = ['make'])
    
    return(df)

def ENCODE_fueltype(df):
    
    encoder         = ce.BinaryEncoder(cols='fuel_type',return_df=True)
    df_temp         = encoder.fit_transform(df['fuel_type']) 
    df_temp.columns = [col.replace('fuel_type_', "FUEL-binenc") for col in df_temp.columns] 

    df = pd.concat([df, df_temp], axis = 1)
    df = df.drop(columns = 'fuel_type')

    return df

def ENCODE_model(df):

    df['model']     = df['model'].str.upper()
    
    encoder         = ce.BinaryEncoder(cols='model',return_df=True)
    df_temp         = encoder.fit_transform(df['model']) 
    df_temp.columns = [col.replace('model_', "MODEL-binenc") for col in df_temp.columns] 

    df = pd.concat([df, df_temp], axis = 1)
    df = df.drop(columns = 'model')

    return df

In [289]:
drop_cols = ['title', 'description', 'original_reg_date', 'category','features', 'accessories', 'fuel_type_category_fill', 'fuel_type_model_make_fill',
       'coe_end', 'lifespan', 'eco_category']

In [290]:
df_encoded = df_train.drop(columns=drop_cols).copy()
df_encoded = ENCODE_transmission(df_encoded)
df_encoded = ENCODE_vehtype(df_encoded)
df_encoded = ENCODE_make(df_encoded)
df_encoded = ENCODE_model(df_encoded)
df_encoded = ENCODE_fueltype(df_encoded)

In [None]:
df_encoded.columns

In [None]:
df_encoded.isna().sum()

In [294]:
GENERAL_CAR_RELATED_COLS = ['TRANSMISSION-manual',
                     'TYPE-binenc0',
                     'TYPE-binenc1',
                     'TYPE-binenc2',
                     'TYPE-binenc3',
                     'MAKE-binenc0',
                     'MAKE-binenc1',
                     'MAKE-binenc2',
                     'MAKE-binenc3',
                     'MAKE-binenc4',
                     'MAKE-binenc5',
                     'MAKE-binenc6',
                     'MODEL-binenc0',
                     'MODEL-binenc1',
                     'MODEL-binenc2',
                     'MODEL-binenc3',
                     'MODEL-binenc4',
                     'MODEL-binenc5',
                     'MODEL-binenc6',
                     'MODEL-binenc7',
                     'MODEL-binenc8',
                     'MODEL-binenc9',]

GENERAL_CAR_AND_AGE_RELATED_COLS = GENERAL_CAR_RELATED_COLS + ['AGE-current',
                     'AGE-remaining'
                    ]


def IMPUTENULL_power(df):
    related_cols = GENERAL_CAR_RELATED_COLS
    df_temp = df[related_cols + ['power']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['power'] = df['power'].fillna(df_temp[0])

    return df
def IMPUTENULL_curbweight(df):
    related_cols = GENERAL_CAR_RELATED_COLS
    df_temp = df[related_cols + ['curb_weight']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['curb_weight'] = df['curb_weight'].fillna(df_temp[0])

    return df

def IMPUTENULL_enginecap(df):
    related_cols =  GENERAL_CAR_RELATED_COLS
    df_temp = df[related_cols + ['engine_cap']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['engine_cap'] = df['engine_cap'].fillna(df_temp[0])

    return df

def IMPUTENULL_depreciation(df):
    related_cols =  GENERAL_CAR_AND_AGE_RELATED_COLS
    df_temp = df[related_cols + ['depreciation']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['depreciation'] = df['depreciation'].fillna(df_temp[0])

    return df

def IMPUTENULL_arf(df):
    related_cols =  GENERAL_CAR_AND_AGE_RELATED_COLS
    df_temp = df[related_cols + ['arf']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['arf'] = df['arf'].fillna(df_temp[0])

    return df

def IMPUTENULL_omv(df):
    related_cols =  GENERAL_CAR_AND_AGE_RELATED_COLS
    df_temp = df[related_cols + ['omv']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['omv'] = df['omv'].fillna(df_temp[0])

    return df

def IMPUTENULL_mileage(df):
    related_cols =  ['TRANSMISSION-manual',
                     'AGE-current',
                     'AGE-remaining'
                    ]
    df_temp = df[related_cols + ['mileage']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['mileage'] = df['mileage'].fillna(df_temp[0])

    return df

def IMPUTENULL_roadtax(df):
    related_cols = GENERAL_CAR_AND_AGE_RELATED_COLS
    df_temp = df[related_cols + ['road_tax']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['road_tax'] = df['road_tax'].fillna(df_temp[0])

    return df

def IMPUTENULL_deregvalue(df):
    related_cols = ['depreciation', 'AGE-remaining', 'omv', 'arf', 'rare']
    df_temp = df[related_cols + ['dereg_value']]
    
    imputer = KNNImputer(n_neighbors=5, weights='distance')
    df_temp = imputer.fit_transform(df_temp)
    df_temp = pd.DataFrame(df_temp[:,-1])
    df['dereg_value'] = df['dereg_value'].fillna(df_temp[0])

    return df


df_encoded = IMPUTENULL_power(df=df_encoded)
df_encoded = IMPUTENULL_curbweight(df=df_encoded)
df_encoded = IMPUTENULL_enginecap(df=df_encoded)
df_encoded = IMPUTENULL_depreciation(df=df_encoded)
df_encoded = IMPUTENULL_omv(df=df_encoded)
df_encoded = IMPUTENULL_arf(df=df_encoded)
df_encoded = IMPUTENULL_mileage(df=df_encoded)
df_encoded = IMPUTENULL_roadtax(df=df_encoded)
df_encoded = IMPUTENULL_deregvalue(df=df_encoded)


In [None]:
corr

In [None]:
df_encoded.isna().sum()

In [133]:
df_encoded['dereg_value_TEMP'] = df_encoded['dereg_value'].fillna(df_encoded['depreciation'] * df_encoded['AGE-remaining'])

# Outlier Detection

In [298]:
df_encoded = df_encoded.drop(columns=['reg_date'])

In [None]:
df_encoded.isna().sum()

In [300]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_encoded)

In [None]:
from sklearn.neighbors import NearestNeighbors

neighbors = NearestNeighbors(n_neighbors=5)
neighbors_fit = neighbors.fit(data_scaled)
distances, indices = neighbors_fit.kneighbors(data_scaled)

distances = np.sort(distances[:, 4], axis=0)
plt.plot(distances)
plt.ylabel('k-distance')
plt.show()

In [None]:
from sklearn.decomposition import PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded)

pca = PCA()
X_reduced = pca.fit(X_scaled)

explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
print(cumulative_variance)

n_components_90 = np.argmax(cumulative_variance >= 0.9) + 1  # +1 because index is 0-based
print(f"Number of components to retain 90% variance: {n_components_90}")


In [None]:
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance) + 1), cumulative_variance, marker='o')
plt.title('Cumulative Explained Variance by Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.axhline(y=0.90, color='r', linestyle='--')  # Line at 95%
plt.axvline(x=n_components_90, color='g', linestyle='--')  # Line indicating number of components
plt.grid()
plt.show()

In [None]:
pca = PCA(n_components=n_components_90)  # Use the number of components determined earlier
pca_data = pca.fit_transform(X_scaled)

# Step 3: Determine the `eps` Parameter using k-distance graph
# Use NearestNeighbors to find the k-nearest distances
k = 5  # Common choice for DBSCAN; can adjust based on data characteristics
neighbors = NearestNeighbors(n_neighbors=k)
neighbors.fit(pca_data)
distances, indices = neighbors.kneighbors(pca_data)
# Get the distances to the k-th nearest neighbor
k_distances = np.sort(distances[:, k-1])

# Plot k-distance graph
plt.figure(figsize=(10, 6))
plt.plot(k_distances)
plt.title('K-Distance Graph')
plt.xlabel('Data Points sorted by Distance to their {}-th Nearest Neighbor'.format(k))
plt.ylabel('Distance')
plt.grid()
plt.show()

In [None]:
k_distances

In [307]:
# Initialize DBSCAN
db = DBSCAN(eps=5, min_samples=5)

# Fit the model
db.fit(X_scaled)

# Get the cluster labels
labels = db.labels_

# Add labels to the original data
df_encoded['cluster'] = labels

# Identify anomalies (labeled as -1 by DBSCAN)
anomalies = df_encoded[df_encoded['cluster'] == -1]

In [None]:
len(anomalies)

In [None]:
df_train['Cluster'] = labels
df_train['Anomaly'] = (labels == -1).astype(int)

anomalies_df = df_train[df_train['Cluster']==-1]
anomalies_df.head()


In [None]:
df_train['Cluster'] = labels
anomalies_df = df_train[(df_train['Cluster']==-1) 
                        & (df_train['rare']==0)]
print(len(anomalies_df))
anomalies_df.head()


In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train[df_train['rare']==0], vars=strong_corr_price + ['price'], hue='Anomaly')
plt.title('Features with strong corr with Price Pairplot (by Cluster)')
# plt.savefig("visualisations/pairplot_strong_corr_by_parf.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train[df_train['rare']==0], vars=moderate_corr_price+['price'], hue='Anomaly')
plt.title('Features with strong corr with Price Pairplot (by Cluster)')
# plt.savefig("visualisations/pairplot_strong_corr_by_rare.png")
plt.show()

In [None]:
df_train['reg_date'].describe()

In [None]:
df_train[df_train['reg_date']=='1959-05-06']

In [None]:
df_train[df_train['AGE-remaining']<0]

In [None]:
neg_age_remaning = df_train[df_train['AGE-remaining']<0]
neg_age_remaning.head()