<img src="images/img.png" />

# CS5228 Project, Group 32

In [135]:
# Auto reload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Data Preprocessing
In this part, we are going to perform some data preprocessing steps. This may include:
* Data cleaning: handle missing values, duplicates, inconsistant or invalid vallues, outliers

* Data reduction: reduce number of attributes, reduce number of attribute values

* Data transformation: attribute construction, normalization

* Data discretization: encode to numerical attributes

### Setting up the Notebook

In [136]:
import os
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

In [210]:
# Load file into pandas dataframe
df = pd.read_csv('./data/train.csv')

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 30 attributes.


### Data Cleaning

Before data cleaning, remove the known attributes that are not meaningful to our prediction model:
  * Meaningless idendifier: listing_id 
  * Attributes in free text: title, description, features, accessories
  * Attribute with the same value: eco_category, indicative_price
  * Attribute unlikely to affect price: curb_weight

In [211]:
columns_to_drop = [
    'listing_id',          # Meaningless identifier
    'title',               # Attributes in free text
    'description',
    'features',
    'accessories',
    'eco_category',        # Attribute with the same value
    'indicative_price',
    # 'curb_weight',         # Attribute unlikely to affect price

    'original_reg_date',
    'lifespan',

    # 'make',
    # 'model',
    # 'type_of_vehicle',
    'transmission',
    'fuel_type',
    # 'no_of_owners',
    'opc_scheme',
    'lifespan',

    # 'category',
]

df = df.drop(columns=columns_to_drop)

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 18 attributes.


### Print Missing Values
Firstly, for each of the columns with missing value, check the number of rows with NaN values.
There are 3 scenarios:
1. NaN value is the major (e.g. fuel_type has 19121 rows with NaN values), we remove the corresponding attritubes.
2. NaN value is the minor. We can choose to fill or delete related data points. 

In [212]:
# Calculate the number of NaN values in each specified column
nan_counts = df.isna().sum()

# Print the number of NaN values for each column
print('Training data')
for column, count in nan_counts.items():
    print(f"Column '{column}' has {count} rows with NaN values.")

Training data
Column 'make' has 1316 rows with NaN values.
Column 'model' has 0 rows with NaN values.
Column 'manufactured' has 7 rows with NaN values.
Column 'reg_date' has 0 rows with NaN values.
Column 'type_of_vehicle' has 0 rows with NaN values.
Column 'category' has 0 rows with NaN values.
Column 'curb_weight' has 307 rows with NaN values.
Column 'power' has 2640 rows with NaN values.
Column 'engine_cap' has 596 rows with NaN values.
Column 'no_of_owners' has 18 rows with NaN values.
Column 'depreciation' has 507 rows with NaN values.
Column 'coe' has 0 rows with NaN values.
Column 'road_tax' has 2632 rows with NaN values.
Column 'dereg_value' has 220 rows with NaN values.
Column 'mileage' has 5304 rows with NaN values.
Column 'omv' has 64 rows with NaN values.
Column 'arf' has 174 rows with NaN values.
Column 'price' has 0 rows with NaN values.


### Remove Exact Duplicates
We remove duplicated data points here.

In [213]:
df = df.drop_duplicates()

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 24993 data points in training data, each with 18 attributes.


### Transform date time attributes to numerical values

In [214]:
df['reg_date'] = pd.to_datetime(df['reg_date'], format='%d-%b-%Y')
df['reg_year'] = df['reg_date'].dt.year
df = df.drop(columns=['reg_date'])

num_records, num_attributes = df.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 24993 data points, each with 18 attributes.


### Plot corellation matrix

In [142]:
# columns_to_keep = [
#     #'model',
#     #'type_of_vehicle',

#     'mileage',
#     'manufactured',
#     'reg_year',
#     'dereg_value',
#     'depreciation',
#     'power',
#     'coe',
#     # 'arf',
#     'omv',
#     'price',
#     # 'road_tax',
#     'engine_cap',
#     'curb_weight',

#     # 'rare & exotic',
#     # 'hybrid cars',

#     # 'low mileage car',
#     # 'almost new car',
#     # 'parf car',

#     # 'coe car',

# ]

# df = df[columns_to_keep]

# correlation_matrix = df.corr()
# plt.figure(figsize=(12, 8))
# sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", square=True)
# plt.title("Correlation Matrix of Attributes")
# plt.show()

### Fill up other missing values.

In [215]:
from util.DataPreprocess import HandlingMissingValues

df = HandlingMissingValues(df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['manufactured'].fillna(df['reg_year'], inplace=True)


NaN values after handling:  0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['mileage'].fillna((2024 - df['reg_year']) * 17500, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mileage'].fillna((2024 - df['reg_year']) * 17500, inplace=True)


### Transform categorical value to numerical values

In [190]:
# categorical_columns = [
#     'make',
#     'model',
#     'type_of_vehicle',
# ]

# encode_dict = {}
# le = LabelEncoder()
# for column in categorical_columns:
#     df[column] = le.fit_transform(df[column])
#     encode_dict[column] = {str(label): int(index) for index, label in enumerate(le.classes_)}

# with open('./data/encode.json', 'w') as file:
#     json.dump(encode_dict, file, indent=4)

### Handle category attribute

In [216]:
from util.DataPreprocess import HandlingCategoryAttribute

df = HandlingCategoryAttribute(df)

Number of unique categories: 15
Unique categories: {'premium ad car', 'imported used vehicle', 'parf car', 'opc car', 'low mileage car', 'electric cars', 'sgcarmart warranty cars', 'almost new car', 'vintage cars', 'sta evaluated car', 'coe car', 'direct owner sale', 'hybrid cars', 'rare & exotic', 'consignment car'}
There are 24259 data points, each with 32 attributes.


### Remove outliers

In [146]:
# from util.DataPreprocess import OutlierRemoval

# df = OutlierRemoval(df, 'model', 'price')

### Saving the Data

In [219]:
file_name = 'data/xi/train_preprocessed.csv'

# Check if the file exists
if os.path.exists(file_name):
    # Delete the file
    os.remove(file_name)
    print(f"Existing file '{file_name}' has been deleted.")

# Save the DataFrame to CSV
df.to_csv(file_name, index=False)
print(f"DataFrame has been saved to '{file_name}'.")

DataFrame has been saved to 'data/xi/train_preprocessed.csv'.


## Data Mining

### 1) Load preprocessed training data

In [224]:
# Load file into pandas dataframe, we saved our preprocessed file at path 'output_file'
training_file = 'data/xi/train_preprocessed.csv'
df = pd.read_csv(training_file)

columns_to_keep = [
    #'model',
    #'type_of_vehicle',

    'mileage',
    'manufactured',
    'reg_year',
    'dereg_value',
    'depreciation',
    'power',
    'coe',
    # 'arf',
    'omv',
    'price',
    # 'road_tax',
    'engine_cap',
    'curb_weight',

    'rare & exotic',
    # 'hybrid cars',

    # 'low mileage car',
    # 'almost new car',
    # 'parf car',

    # 'coe car',

]

df_new = df[columns_to_keep]

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 24259 data points in training data, each with 32 attributes.


### Data Augmentation, copy rows with less than 5 samples by group

In [160]:
# from util.DataPreprocess import DataAugmentation

# df_aug = DataAugmentation(df)

# num_records, num_attributes = df_aug.shape
# print("There are {} data points after augmentation, each with {} attributes.". format(num_records, num_attributes))

### 2) Split input attributes and output attribute

In [225]:
y = df_new['price']
X = df_new.drop(columns=['price'])

### 3) Hyperparameter tuning

In [222]:
import numpy as np

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_validate

# Only considered hyperparameter: max depth of trees
param_choices = [1, 2, 3, 4, 5, 6]
# param_choices = [8, 10, 12, 15, 20]

# Keep track of results for visualization
param_to_scores = {}

for param in param_choices:

    # Train regressor with the current parameter setting
    # regressor = DecisionTreeRegressor(max_depth=param)
    # regressor = RandomForestRegressor(max_depth=param)
    regressor = GradientBoostingRegressor(max_depth=param)
    
    # Perform 10-fold cross_validations
    scores = cross_validate(regressor, X, y, cv=10, scoring='neg_root_mean_squared_error', return_train_score=True)
    
    # Extract the 10 RSME scores (training scores and validation scores) for each run/fold
    # The (-1) is only needed since we get the negative root mean squared errors (it's a sklearn thing)
    rsme_train = scores['train_score'] * (-1)
    rsme_valid = scores['test_score'] * (-1)
    
    ## Keep track of all num_folds f1 scores for current param (for plotting)
    param_to_scores[param] = (rsme_train, rsme_valid)
    
    ## Print statement for some immediate feedback (values in parenthesis represent the Standard Deviation)
    print('param = {}, RSME training = {:.1f} ({:.1f}), RSME validation = {:.1f} ({:.1f})'
          .format(param, np.mean(rsme_train), np.std(rsme_train), np.mean(rsme_valid), np.std(rsme_valid)))

param = 1, RSME training = 36069.1 (587.2), RSME validation = 39168.8 (5227.9)
param = 2, RSME training = 23935.2 (249.8), RSME validation = 29194.0 (3873.9)
param = 3, RSME training = 15602.9 (319.7), RSME validation = 22345.7 (4131.3)
param = 4, RSME training = 10612.6 (182.9), RSME validation = 19879.8 (5390.0)
param = 5, RSME training = 7464.5 (74.1), RSME validation = 18988.9 (5892.2)
param = 6, RSME training = 5473.4 (84.3), RSME validation = 19476.8 (6693.9)


## Prediction

### 1) Load and preprocess test dataset

In [226]:
from util.DataPreprocess import HandlingMissingValuesTest

test_file = './data/test.csv'
df_test = pd.read_csv(test_file)

df_test['reg_date'] = pd.to_datetime(df_test['reg_date'], format='%d-%b-%Y')
df_test['reg_year'] = df_test['reg_date'].dt.year
df_test = df_test.drop(columns=['reg_date'])

df_test = HandlingMissingValuesTest(df, df_test)
df_test = HandlingCategoryAttribute(df_test)

file_name = 'data/xi/test_preprocessed.csv'

# Check if the file exists
if os.path.exists(file_name):
    # Delete the file
    os.remove(file_name)
    print(f"Existing file '{file_name}' has been deleted.")

# Save the DataFrame to CSV
df_test.to_csv(file_name, index=False)
print(f"DataFrame has been saved to '{file_name}'.")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test['manufactured'].fillna(df_test['reg_year'], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test['mileage'].fillna((2024 - df_test['reg_year']) * 17500, inplace=True)


NaN values after handling:  50669
Number of unique categories: 15
Unique categories: {'premium ad car', 'imported used vehicle', 'parf car', 'opc car', 'electric cars', 'sta evaluated car', 'sgcarmart warranty cars', 'almost new car', 'vintage cars', 'consignment car', 'coe car', 'direct owner sale', 'hybrid cars', 'rare & exotic', 'low mileage car'}
There are 10000 data points, each with 43 attributes.
DataFrame has been saved to 'data/xi/test_preprocessed.csv'.


### 2) Predict using GBR with selected hyperparameter from the previous fine-tuning step

In [228]:
# Load file into pandas dataframe, we saved our preprocessed file at path 'output_file'
test_file = 'data/xi/test_preprocessed.csv'
df_test = pd.read_csv(test_file)

columns_to_keep = [col for col in df_new.columns if col != 'price']

df_test = df_test[columns_to_keep]
df_test = df_test.fillna(X.mean())
# Calculate the number of NaN values in each specified column
nan_counts = df_test.isna().sum()

# Print the number of NaN values for each column
print('Test data')
for column, count in nan_counts.items():
    print(f"Column '{column}' has {count} rows with NaN values.")

gbr_model = GradientBoostingRegressor(max_depth=4)
gbr_model.fit(X, y)
y_pred = gbr_model.predict(df_test)

predictions_df = pd.DataFrame({
    'Id': df_test.index,
    'Predicted': y_pred
})

# Save to a CSV file
predictions_df.to_csv('data/xi/predictions.csv', index=False)
print("Prediction file 'predictions.csv' generated successfully.")

Test data
Column 'mileage' has 0 rows with NaN values.
Column 'manufactured' has 0 rows with NaN values.
Column 'reg_year' has 0 rows with NaN values.
Column 'dereg_value' has 0 rows with NaN values.
Column 'depreciation' has 0 rows with NaN values.
Column 'power' has 0 rows with NaN values.
Column 'coe' has 0 rows with NaN values.
Column 'omv' has 0 rows with NaN values.
Column 'engine_cap' has 0 rows with NaN values.
Column 'curb_weight' has 0 rows with NaN values.
Column 'rare & exotic' has 0 rows with NaN values.
Prediction file 'predictions.csv' generated successfully.


### Load test data and preprocess

In [15]:
test_file = './data/test.csv'
df_test = pd.read_csv(test_file)

df_test['reg_date'] = pd.to_datetime(df_test['reg_date'], format='%d-%b-%Y')
df_test['reg_year'] = df_test['reg_date'].dt.year
df_test = df_test.drop(columns=['reg_date'])

# Replace '-' with an empty string
df_test['category'] = df_test['category'].replace('-', '')

# Split the 'category' column into lists
df_test['category_list'] = df_test['category'].str.split(', ')

# Handle empty strings by replacing them with empty lists
df_test['category_list'] = df_test['category_list'].apply(lambda x: [] if x == [''] else x)

# Import itertools for flattening lists
from itertools import chain

# Flatten the list of lists to a single list
all_categories = list(chain.from_iterable(df_test['category_list']))

# Get the unique categories
unique_categories = set(all_categories)

# Print the number of unique categories
print(f"Number of unique categories: {len(unique_categories)}")
print("Unique categories:", unique_categories)

# Initialize the MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the category lists
category_dummies = mlb.fit_transform(df_test['category_list'])

# Create a DataFrame with the one-hot encoded categories
category_df = pd.DataFrame(category_dummies, columns=mlb.classes_, index=df_test.index)

# Concatenate the new dummy columns to the original DataFrame
df_test = pd.concat([df_test, category_df], axis=1)

# Drop the temporary 'category_list' column if desired
df_test.drop('category_list', axis=1, inplace=True)
df_test.drop('category', axis=1, inplace=True)

num_records, num_attributes = df.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

Number of unique categories: 15
Unique categories: {'sgcarmart warranty cars', 'hybrid cars', 'rare & exotic', 'almost new car', 'direct owner sale', 'imported used vehicle', 'consignment car', 'vintage cars', 'electric cars', 'sta evaluated car', 'parf car', 'opc car', 'coe car', 'low mileage car', 'premium ad car'}
There are 24258 data points, each with 17 attributes.


### Select attributes on test data

In [17]:
num_records, num_attributes = df_test.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

categorical_columns = [
    'make',
    'model',
    'type_of_vehicle',
    'transmission',
]

with open('./data/encode.json', 'r') as file:
    data = json.load(file)

for col, cate_dict in data.items():
    if col in df_test.columns:
        df_test[col] = df_test[col].map(cate_dict)

df_test = df_test[columns_to_keep]

num_records, num_attributes = df_test.shape
print("There are {} data points in test data, each with {} attributes.". format(num_records, num_attributes))

There are 10000 data points, each with 43 attributes.
There are 10000 data points in test data, each with 16 attributes.


### Check if train data has all models in test data

In [19]:
models_in_df = set(df['model'].unique())
models_in_df_test = set(df_test['model'].unique())

if models_in_df_test.issubset(models_in_df):
    print("df includes all models in df_test")
else:
    missing_models = models_in_df_test - models_in_df
    print("df does not include", missing_models)

df does not include {np.float64(nan)}


### Mining code here

In [26]:
from util.DataMining import split_dataframe, split_dataframe_flex
from util.DataMining import (
    RandomForestMining,
    RandomForestMiningByModel,
    GradientBoostingMining,
    LinearRegressionMining,
    LinearRegressionMiningByModel,
    CombinedDataMiningRandomForestAndLinearRegression
)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [21]:
# run_times, rmse_sum = 5, 0
# for i in tqdm(range(run_times), desc='Running Random Forest'):
#     target_col = 'price'
#     x_train, x_test, y_train, y_test = split_dataframe(df, target_col)
#     rmse_sum += RandomForestMining(x_train, x_test, y_train, y_test)
# print('Average RMSE:', round(rmse_sum / run_times))

In [27]:
run_times, rmse_sum = 1, 0
for i in tqdm(range(run_times), desc='Running Random Forest'):
    train_drop_cols = ['price']
    test_cols = ['price', 'model']
    x_train, x_test, y_train, y_test = split_dataframe_flex(df_aug, train_drop_cols, test_cols)
    rmse_sum += RandomForestMiningByModel(x_train, x_test, y_train, y_test)
print('Average RMSE:', round(rmse_sum / run_times))

  y_pred = pd.concat([y_pred, temp_df])
Running Random Forest: 100%|██████████| 1/1 [03:29<00:00, 209.39s/it]

Data saved to results.csv
Running not in develop mode
RMSE on test data: 15525.294495071965
Average RMSE: 15525





In [23]:
# x_train, y_train = df.drop(columns=['price']), df['price']
# x_test = df_test[x_train.columns]

# res = RandomForestMining(x_train, x_test, y_train, dev=True)
# res.to_csv('./data/res.csv', index=False)

### This cell do prediction model by model

In [27]:
x_train, y_train = df.drop(columns=['price']), df[['price', 'model']]
x_test = df_test[x_train.columns].dropna(subset=['model'])

res_model = RandomForestMiningByModel(x_train, x_test, y_train, dev=True)
print(res_model.head())

  y_pred = pd.concat([y_pred, temp_df])


0     19560.955
1     33476.140
2    143657.020
3     73018.820
4     27141.215
Name: Predicted, dtype: float64


### This cell do prediction on test data with 'model' attribute missing

In [28]:
x_train, y_train = df.drop(columns=['price', 'model']), df[['price']]
x_test = df_test_unmapped[x_train.columns]

res_nomodel = RandomForestMining(x_train, x_test, y_train, dev=True)
print(res_nomodel.head())

  return fit_method(estimator, *args, **kwargs)


         Predicted
21    56183.438972
195  285485.266971
212  166325.453607
402   19893.851406
412   73037.641540


In [29]:
print(len(res_model))
print(len(res_nomodel))
res = pd.concat([res_model, res_nomodel])
res.to_csv('./data/res_by_model_original.csv')
res.reset_index(inplace=True)
res.rename(columns={'index': 'Id'}, inplace=True)
res_sorted = res.sort_values(by='Id')
res_sorted.to_csv('./data/res_by_model2.csv', index=False)

9902
98
