# CS5228-2410 Final Project EDA

## Overview

*(from [Kaggle Project](https://www.kaggle.com/competitions/cs-5228-2410-final-project))*

In this project, we look into the market for used cars in Singapore. Car ownership in Singapore is rather expensive which includes the very high prices for new and used cars (compared to many other countries). There are many stakeholders in this market. Buyers and sellers want to find good prices, so they need to understand what affects the value of a car. Online platforms facilitating the sale of used cars, on the other hand, want to maximize the number of sales/transactions.

The goal of this task is to predict the resale price of a car based on its properties (e.g., make, model, mileage, age, power, etc). It is therefore first and foremost a regression task. These different types of information allow you to come up with features for training a regressor. It is part of the project for you to justify, derive and evaluate different features. Besides predicting the outcome in terms of a dollar value, other useful results include the importance of different attributes, the evaluation and comparison of different regression techniques, an error analysis and discussion about limitations and potential extensions, etc.

## Main Steps
1. Load dataset
2. Initial inspection
    - Basic information
    - Summary stats
3. Data quality check
    - Missing values
    - Identify duplicates
    - Examine data types
4. EDA
    - Univariate analysis
        - Numerical: histograms, boxplots
        - Categorical: barplots, pie charts
    - Bivariate analysis
        - Scatterplots: price v numerical
        - Boxplots: price v categorical
    - Correlation analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from wordcloud import WordCloud
import difflib

In [2]:
pd.set_option('display.max_columns', None)

## 1    Load data

In [3]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

## 2    Initial Inspection

In [None]:
print("# of records: ",len(df_train))
print("# of columns: ",len(df_train.columns))

In [None]:
df_train.info()

## 3    Data Quality Check

In [None]:
# Calculate missing values
def get_missing_info(df_train):
    missing_values = df_train.isnull().sum()
    missing_percentages = (missing_values / len(df_train) * 100).round(2)

    missing_info = pd.DataFrame({
        'missing_count': missing_values,
        'missing_perc': missing_percentages
    }).sort_values(by="missing_count", ascending=False)

    return missing_info

missing_info = get_missing_info(df_train=df_train)
print("\n=== missing_count Analysis ===")
print(missing_info[missing_info['missing_count'] > 0].sort_values('missing_perc', ascending=False))


The dataset has significant missing data across several features:

- 'indicative_price' is completely missing (100%)
- 'opc_scheme' and 'original_reg_date' are missing for nearly all records (>98%)
- 'lifespan' is missing for about 90% of the records
- 'fuel_type' is missing for about 76% of the records
- 'mileage' is missing for about 21% of the records
- Several other fields have missing data ranging from 0.03% to 15.25%


In [None]:
print("\n=== Duplicate Records Analysis ===")
print(f"Number of duplicate rows: {df_train.duplicated().sum()}")
print(f"Number of duplicate listing_ids: {df_train['listing_id'].duplicated().sum()}")


There are no duplicate rows or listing_ids in the dataset.

In [None]:
df_train.sample(3)

**Observations**
- Since `indicative_price` is completely missing, drop column. 
- `listing_id` is not meaningful to the analysis as well. 
- Since `original_reg_date` is almost entirely missing, feature is not meaningful. Based on context, last `reg_date` may be more useful as it may be closely related to the COE price upon time of registration. COE has a heavy influence on car resale price.
- `fuel_type` has many missing values but can potentially be obtained from `category`.
- `lifespan` has many missing values but can potentially be inferred from the `title`.

###    Textual Features Extraction

In [None]:
categorical_columns = df_train.select_dtypes(include=["object"]).columns
categorical_columns

In [10]:
text_features = ['title', 'description', 'category', 'features', 'accessories', 'opc_scheme', 'eco_category']

In [None]:
# Create a word cloud for each text feature
for feature in text_features:
    text_data = ' '.join(df_train[feature].dropna().tolist())
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text_data)

    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')  
    plt.title(f'Wordcloud – {feature}')
    plt.savefig(f'visualisations/wordcloud_{feature}.png')
    plt.show()

### Data Cleaning & Data Transformation

In [12]:
## Create binary variables from text features

# Convert opc scheme to binary for further analysis
df_train['opc_scheme'] = df_train['opc_scheme'].apply(lambda x: 1 if pd.notna(x) else 0)
# Create column for parf v coe cars 
df_train['parf'] = df_train['category'].apply(lambda x: 1 if 'parf' in x else 0)
# Create column for rare & exotic cars
df_train['rare'] = df_train['category'].apply(lambda x: 1 if 'rare & exotic' in x else 0)
# Create column for vintage cars
df_train['vintage'] = df_train['category'].apply(lambda x: 1 if 'vintage' in x else 0)

In [None]:
binary_columns = ['opc_scheme', 'parf', 'rare', 'vintage']
for col in binary_columns:
    print(f"\n{col}:")
    print(df_train[col].value_counts().head())

In [None]:
print("\n=== Fix Missing Values (make) ===")
print("Missing values: ",df_train['make'].isna().sum())


df_train['make']          = df_train['make'].str.upper()
df_train['make']          = df_train['make'].str.replace(' ','').str.strip()
df_train['title']         = df_train['title'].str.upper()

make_list                   = [make for make in df_train['make'].unique().tolist() if type(make) == str]
make_list.sort()
print("Unique values: ")
print(make_list[:20])
print()

df_train['make_temp']     = df_train['title'].str.split(' ').str[0]
df_train['make_temp_similar']     = df_train.apply(lambda x: difflib.get_close_matches(x['make_temp'], make_list, n=1)[0], axis=1)

df_train['make']          = df_train['make'].fillna(df_train['make_temp'])
df_train                  = df_train.drop(columns = ['make_temp', 'make_temp_similar'])
print("Missing values (after cleaning): ", df_train['make'].isna().sum())

In [None]:
print("\n=== Fix Missing Values (manufactured) ===")
print("Missing values: ",df_train['manufactured'].isna().sum())

df_train['original_reg_date']         = pd.to_datetime(df_train['original_reg_date'], format = "%d-%b-%Y")
df_train['reg_date']                  = pd.to_datetime(df_train['reg_date'], format = "%d-%b-%Y")

df_train['original_reg_date_temp']    = df_train['original_reg_date'].dt.year
df_train['reg_date_temp']             = df_train['reg_date'].dt.year
df_train['manufactured']              = df_train['manufactured'].fillna(df_train[['original_reg_date_temp','reg_date_temp']].min(axis=1))
df_train['manufactured']              = df_train['manufactured'].astype(int).astype(str)
df_train                              = df_train.drop(columns = ['original_reg_date_temp', 'reg_date_temp'])
print("Missing values (after cleaning): " ,df_train['manufactured'].isna().sum())


In [None]:
print("\n=== Fix Missing Values (no_of_owners) ===")
print("Missing values: ",df_train['no_of_owners'].isna().sum())
print("\nSummary statistics: ", df_train['no_of_owners'].describe())
df_train['no_of_owners'] = df_train['no_of_owners'].fillna(df_train['no_of_owners'].median())
print("\nMissing values (after cleaning): " ,df_train['no_of_owners'].isna().sum())

In [None]:
# Extract fuel type from category
fuel_keywords = {
    'electric': 'electric',
    'hybrid': 'petrol-electric'
}

def extract_fuel_type(category_text):
    category_text = category_text.lower()  
    for keyword, fuel_type in fuel_keywords.items():
        if keyword in category_text:
            return fuel_type
    return None

# Apply the function to the rows where fuel_type is missing
df_train['fuel_type_category_fill'] = df_train['fuel_type']
print(f"Number of missing values for fuel_type (before): {df_train['fuel_type_category_fill'].isna().sum()}")
df_train.loc[df_train['fuel_type'].isna(), 'fuel_type_category_fill'] = df_train['category'].apply(extract_fuel_type)
print(f"Number of missing values for fuel_type (after): {df_train['fuel_type_category_fill'].isna().sum()}")


In [None]:
# Create mapping for fuel type from model make
fuel_type_mapping = df_train.groupby(['make', 'model'])['fuel_type'].agg(lambda x: x.mode()[0] if not x.mode().empty else None).reset_index()
fuel_type_dict = dict(zip(zip(fuel_type_mapping['make'], fuel_type_mapping['model']), fuel_type_mapping['fuel_type']))

# Define a function to fill missing fuel types
def fill_fuel_type(row, fuel_type_dict):
    if pd.isna(row['fuel_type']):
        return fuel_type_dict.get((row['make'], row['model']), None)
    return row['fuel_type']

# Apply the function to fill in missing values
df_train['fuel_type_model_make_fill'] = df_train['fuel_type']
print(f"Number of missing values for fuel_type (before): {df_train['fuel_type_model_make_fill'].isna().sum()}")
df_train.loc[df_train['fuel_type'].isna(), 'fuel_type_model_make_fill']  = df_train.apply(fill_fuel_type, axis=1, fuel_type_dict=fuel_type_dict)
print(f"Number of missing values for fuel_type (after): {df_train['fuel_type_model_make_fill'].isna().sum()}")

df_train['fuel_type'] = df_train['fuel_type_model_make_fill']


In [48]:
# Extract coe end date from title
df_train['coe_end'] = df_train['title'].str.extract(r'\(COE TILL (\d{2}/\d{4})\)')
df_train['coe_end'] = pd.to_datetime(df_train['coe_end'], format='%m/%Y')

In [None]:
missing_info = get_missing_info(df_train=df_train)
missing_info

## 5 EDA

### Univariate Analysis

#### Numerical Analysis

In [None]:
# Identify numerical columns
numerical_cols = df_train.select_dtypes(include=['int64', 'float64']).columns.drop(['listing_id', 'indicative_price'])

print("\n=== Numerical Features Analysis ===")
print(df_train[numerical_cols].describe())

# Check for outliers using IQR method
print("\n=== Outlier Analysis (IQR Method) ===")
for col in numerical_cols:
    Q1 = df_train[col].quantile(0.25)
    Q3 = df_train[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df_train[(df_train[col] < (Q1 - 1.5 * IQR)) | (df_train[col] > (Q3 + 1.5 * IQR))][col]
    if len(outliers) > 0:
        print(f"\n{col}:")
        print(f"Number of outliers: {len(outliers)}")
        print(f"Percentage of outliers: {(len(outliers)/len(df_train)*100):.2f}%")

fig, axes = plt.subplots(len(numerical_cols), 2, figsize=(10, 2*len(numerical_cols)))

for i, feature in enumerate(numerical_cols):
    
    # Histogram
    sns.histplot(df_train[feature].dropna(), kde=True, ax=axes[i, 0])
    axes[i, 0].set_title(f'Histogram of {feature}')
    axes[i, 0].set_xlabel(feature)
    
    # Box plot
    sns.boxplot(x=df_train[feature].dropna(), ax=axes[i, 1])
    axes[i, 1].set_title(f'Box Plot of {feature}')
    axes[i, 1].set_xlabel(feature)

plt.tight_layout()
plt.savefig("visualisations/distribution_numerical.png", bbox_inches='tight')
plt.show()


In [None]:
"""Plot correlation matrix of numerical features"""
corr = df_train[numerical_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.savefig("visualisations/correlation_matrix.png")
plt.show()

**Strongest correlations with price**
- dereg_value (0.91)
- arf (0.89)
- omv (0.82)
- depreciation (0.81)
- power (0.70)

Focus on these features for initial modelling effort.


**Moderate correlations with price**
- road_tax (0.52)
- engine_cap (0.44)
- coe (0.35)
- mileage (-0.39)
- rare (0.60)

**Low correlation with price**
- manufactured (0.20)
- curb_weight (0.15)
- no_of_owners (-0.08)

**Strong correlations between features**

- omv and arf (0.94)
- engine_cap and road_tax (0.94)
- power and engine_cap (0.86)

Potential multicollinearity which may impact some models (tree-based models are less sensitive). Choose most relevant features or employ dimensionality reduction methods like PCA.
Feature engineering to create interaction terms between strongly related features (e.g. power * engine cap)






In [71]:
strong_corr_price = ['dereg_value', 'arf', 'omv', 'depreciation', 'power']
moderate_corr_price = ['road_tax', 'engine_cap', 'coe', 'mileage', 'rare']

In [None]:
categorical_and_binary_cols

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=strong_corr_price + ['price'], hue='parf')
plt.title('Features with strong corr with Price Pairplot (by parf)')
plt.savefig("visualisations/pairplot_strong_corr_by_parf.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=strong_corr_price+['price'], hue='rare')
plt.title('Features with strong corr with Price Pairplot (by rare)')
plt.savefig("visualisations/pairplot_strong_corr_by_rare.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=moderate_corr_price + ['price'], hue='parf')
plt.title('Features with strong corr with Price Pairplot (by parf)')
plt.savefig("visualisations/pairplot_mod_corr_by_parf.png")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.pairplot(df_train, vars=moderate_corr_price+['price'], hue='rare')
plt.title('Features with strong corr with Price Pairplot (by rare)')
plt.savefig("visualisations/pairplot_mod_corr_by_rare.png")
plt.show()

#### Categorical Analysis

In [None]:
# Identify categorical columns
categorical_cols = df_train.select_dtypes(include=['object']).columns

print("\n=== Categorical Features Analysis ===")
for col in categorical_cols:
    unique_values = df_train[col].nunique()
    print(f"\n{col}:")
    print(f"Number of unique values: {unique_values}")
    if unique_values < 10:  # Only show value counts for columns with few unique values
        print(df_train[col].value_counts().head())


**Observations**

- eco_category has the same value for all rows. Drop column.
- opc_scheme has 3 unique values. Present value seem to indicate that the car is under OPC scheme. Convert to binary.


#### Categorical Analysis (with added binary features)

In [None]:
print(df_train.columns)
print(categorical_cols)
print(binary_columns)

In [None]:
categorical_and_binary_cols = list(categorical_cols) + binary_columns
categorical_and_binary_cols

In [None]:
MAX_UNIQUE_VALUES = 30
for feature in categorical_and_binary_cols:
    if len(df_train[feature].value_counts()) < MAX_UNIQUE_VALUES:
        plt.figure(figsize=(8, 4))
        sns.countplot(y=feature, data=df_train, order=df_train[feature].value_counts().index)
        plt.title(f'Distribution of {feature}')
        plt.tight_layout()
        plt.savefig(f'visualisations/distribution_categorical_{feature}.png')
        plt.show()

        
        # Relationship with price
        plt.figure(figsize=(8, 4))
        sns.boxplot(x=feature, y='price', data=df_train)
        plt.title(f'{feature} vs Price')
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.savefig(f'visualisations/price_vs_{feature}.png')
        plt.show()



**Observations**
- Type of vehicle:
    - SUVs, luxury sedans, and sports cars are the most common vehicle types in the dataset. - SUVs are highly represented, with relatively lower price variability compared to luxury sedans and sports cars.
    - Truck, station wagon, and bus/mini bus have lower counts, indicating that these vehicle types are less represented in this dataset, possibly niche categories. Generally lower median prices.
    - Luxury sedans and sports cars show the highest resale prices, as expected given the premium nature of these vehicles. There are several outliers in these categories, indicating that some cars are priced significantly higher than the rest.
    - SUVs have a relatively wide price range, with a number of outliers in the upper range, possibly high-end or luxury SUVs. 
    
- Tranmission type:
    - The majority of vehicles in the dataset have automatic transmission.
    - Manual transmission vehicles tend to have a lower price range overall compared to automatics. However, there are still some outliers with higher prices, which may correspond to specific models that are rarer or in high demand among enthusiasts (e.g., sports cars with manual transmission).

- Fuel type:
    - The most common fuel type is diesel, followed by petrol-electric and petrol. This suggests a significant preference for diesel vehicles in the dataset.
    - Electric vehicles are less common, while diesel-electric vehicles have the lowest count among the listed fuel types.
    - Petrol-electric vehicles also show a significant range in pricing, with some outliers that could represent premium models. This suggests that hybrid vehicles are valued well in the resale market.
    - Electric vehicles have a narrower price range compared to diesel and petrol-electric, suggesting that the market for used electric vehicles may not be as established yet, potentially affecting their resale value.
    - Note: Many missing values for this feature
- opc_scheme:
    - Since the feature is highly imbalanced, box plot is not very useful
- parf:
    - parf cars have a slightly higher range compared to coe cars. 
- rare:
    - Rare cars can command much higher resale prices. 

In [58]:
date_features = ['original_reg_date','reg_date', 'lifespan', 'coe_end']

In [None]:
print("\n=== Date Fields Analysis ===")
for col in date_features:

    print(f"\n{col}:")
    print(f"Unique values sample: {df_train[col].unique()[:5]}")
    # Check for invalid dates
    try:
        df_train[col] = pd.to_datetime(df_train[col])
        print("Min date:", df_train[col].min())
        print("Max date:", df_train[col].max())
    except:
        print("Error converting to datetime - possible invalid date formats")


In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(df_train['price'])
plt.title('Boxplot of Price to detect outliers')
plt.show()



In [None]:
df_train[df_train['price']==max(df_train['price'])]