<img src="images/img.png" />

# CS5228 Project, Group 32

In [1]:
# Auto reload
%load_ext autoreload
%autoreload 2

## Data Preprocessing
In this part, we are going to perform some data preprocessing steps. This may include:
* Data cleaning: handle missing values, duplicates, inconsistant or invalid vallues, outliers

* Data reduction: reduce number of attributes, reduce number of attribute values

* Data transformation: attribute construction, normalization

* Data discretization: encode to numerical attributes

### Setting up the Notebook

In [2]:
import os
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Load file into pandas dataframe
df = pd.read_csv('./data/train.csv')

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 30 attributes.


### Data Cleaning

Before data cleaning, remove the known attributes that are not meaningful to our prediction model:
  * Meaningless idendifier: listing_id 
  * Attributes in free text: title, description, features, accessories
  * Attribute with the same value: eco_category, indicative_price
  * Attribute unlikely to affect price: curb_weight

In [4]:
columns_to_drop = [
    'listing_id',          # Meaningless identifier
    'title',               # Attributes in free text
    'description',
    'features',
    'accessories',
    'eco_category',        # Attribute with the same value
    'indicative_price',
    'curb_weight',         # Attribute unlikely to affect price
    'original_reg_date',
    'lifespan',
]

df = df.drop(columns=columns_to_drop)

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 20 attributes.


### Handle Missing Values
Firstly, for each of the columns with missing value, check the number of rows with NaN values.
There are 3 scenarios:
1. NaN value is the major (e.g. fuel_type has 19121 rows with NaN values), we remove the corresponding attritubes.
2. NaN value is the minor. We can choose to fill or delete related data points. 

In [5]:
columns_to_check = [
    'make',
    'fuel_type',
    'manufactured',
    'power',
    'engine_cap',
    'mileage',
    'no_of_owners',
    'depreciation',
    'road_tax',
    'dereg_value',
    'omv',
    'arf',
    'opc_scheme'
]

# Calculate the number of NaN values in each specified column
nan_counts = df[columns_to_check].isna().sum()

# Print the number of NaN values for each column
print('Training data')
for column, count in nan_counts.items():
    print(f"Column '{column}' has {count} rows with NaN values.")

Training data
Column 'make' has 1316 rows with NaN values.
Column 'fuel_type' has 19121 rows with NaN values.
Column 'manufactured' has 7 rows with NaN values.
Column 'power' has 2640 rows with NaN values.
Column 'engine_cap' has 596 rows with NaN values.
Column 'mileage' has 5304 rows with NaN values.
Column 'no_of_owners' has 18 rows with NaN values.
Column 'depreciation' has 507 rows with NaN values.
Column 'road_tax' has 2632 rows with NaN values.
Column 'dereg_value' has 220 rows with NaN values.
Column 'omv' has 64 rows with NaN values.
Column 'arf' has 174 rows with NaN values.
Column 'opc_scheme' has 24838 rows with NaN values.


We delete attributes with TOO many NaN value here.

In [6]:
columns_to_drop_nan = [
    'fuel_type',
    'opc_scheme'
]

df = df.drop(columns=columns_to_drop_nan)

Then we try to fill up other missing values.

In [7]:
from util.DataPreprocess import HandlingMissingValues

df = HandlingMissingValues(df)

NaN values after handling:  0


### Remove Exact Duplicates
We remove duplicated data points here.

In [8]:
df = df.drop_duplicates()

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 24258 data points in training data, each with 18 attributes.


### Merge rows with fewer data points on specific attributes

In [9]:
threshold = 2

value_counts = df['make'].value_counts()
categories_to_replace = value_counts[value_counts < threshold].index
df['make'] = df['make'].replace(categories_to_replace, 'others')

### Transform categorical value to numerical values

In [10]:
categorical_columns = [
    'make',
    'model',
    'type_of_vehicle',
    'transmission',
]

encode_dict = {}
le = LabelEncoder()
for column in categorical_columns:
    df[column] = le.fit_transform(df[column])
    encode_dict[column] = {str(label): int(index) for index, label in enumerate(le.classes_)}

with open('./data/encode.json', 'w') as file:
    json.dump(encode_dict, file, indent=4)

### Transform date time attributes to numerical values

In [11]:
df['reg_date'] = pd.to_datetime(df['reg_date'], format='%d-%b-%Y')
df['reg_year'] = df['reg_date'].dt.year
df = df.drop(columns=['reg_date'])

num_records, num_attributes = df.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 24258 data points, each with 18 attributes.


### Handle category attribute

In [12]:
from util.DataPreprocess import HandlingCategoryAttribute

df = HandlingCategoryAttribute(df)

Number of unique categories: 15
Unique categories: {'electric cars', 'opc car', 'rare & exotic', 'consignment car', 'almost new car', 'imported used vehicle', 'parf car', 'premium ad car', 'vintage cars', 'coe car', 'direct owner sale', 'hybrid cars', 'low mileage car', 'sgcarmart warranty cars', 'sta evaluated car'}
There are 24258 data points, each with 32 attributes.


### Remove outliers

In [13]:
# from util.DataPreprocess import OutlierRemoval

# df = OutlierRemoval(df, 'model', 'price')

### Saving the Data

In [14]:
file_name = './data/train_preprocessed.csv'

# Check if the file exists
if os.path.exists(file_name):
    # Delete the file
    os.remove(file_name)
    print(f"Existing file '{file_name}' has been deleted.")

# Save the DataFrame to CSV
df.to_csv(file_name, index=False)
print(f"DataFrame has been saved to '{file_name}'.")

Existing file './data/train_preprocessed.csv' has been deleted.
DataFrame has been saved to './data/train_preprocessed.csv'.


## Data Mining

### Load preprocessed training data

In [52]:
# Load file into pandas dataframe, we saved our preprocessed file at path 'output_file'
training_file = './data/train_preprocessed.csv'
df = pd.read_csv(training_file)

columns_to_keep = [
    'model',
    'mileage',
    'low mileage car',
    'manufactured',
    'reg_year',
    'type_of_vehicle',
    'dereg_value',
    'depreciation',
    'power',
    'coe',
    'arf',
    'omv',
    'price',
    'road_tax',
    'almost new car',
    'coe car',
    'parf car',
]

df = df[columns_to_keep]
columns_to_keep = [col for col in df.columns if col != 'price']

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 24258 data points in training data, each with 17 attributes.


### Load test data and preprocess

In [67]:
test_file = './data/test.csv'
df_test = pd.read_csv(test_file)

df_test['reg_date'] = pd.to_datetime(df_test['reg_date'], format='%d-%b-%Y')
df_test['reg_year'] = df_test['reg_date'].dt.year
df_test = df_test.drop(columns=['reg_date'])

# Replace '-' with an empty string
df_test['category'] = df_test['category'].replace('-', '')

# Split the 'category' column into lists
df_test['category_list'] = df_test['category'].str.split(', ')

# Handle empty strings by replacing them with empty lists
df_test['category_list'] = df_test['category_list'].apply(lambda x: [] if x == [''] else x)

# Import itertools for flattening lists
from itertools import chain

# Flatten the list of lists to a single list
all_categories = list(chain.from_iterable(df_test['category_list']))

# Get the unique categories
unique_categories = set(all_categories)

# Print the number of unique categories
print(f"Number of unique categories: {len(unique_categories)}")
print("Unique categories:", unique_categories)

# Initialize the MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the category lists
category_dummies = mlb.fit_transform(df_test['category_list'])

# Create a DataFrame with the one-hot encoded categories
category_df = pd.DataFrame(category_dummies, columns=mlb.classes_, index=df_test.index)

# Concatenate the new dummy columns to the original DataFrame
df_test = pd.concat([df_test, category_df], axis=1)

# Drop the temporary 'category_list' column if desired
df_test.drop('category_list', axis=1, inplace=True)
df_test.drop('category', axis=1, inplace=True)

num_records, num_attributes = df.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

Number of unique categories: 15
Unique categories: {'electric cars', 'opc car', 'rare & exotic', 'consignment car', 'almost new car', 'imported used vehicle', 'parf car', 'premium ad car', 'vintage cars', 'coe car', 'direct owner sale', 'hybrid cars', 'low mileage car', 'sgcarmart warranty cars', 'sta evaluated car'}
There are 24258 data points, each with 17 attributes.


### Data Augmentation, copy rows with less than 5 samples by group

In [68]:
from util.DataPreprocess import DataAugmentation

df_aug = DataAugmentation(df)

num_records, num_attributes = df_aug.shape
print("There are {} data points after augmentation, each with {} attributes.". format(num_records, num_attributes))

There are 40290 data points after augmentation, each with 17 attributes.


### Select attributes on test data

In [69]:
num_records, num_attributes = df_test.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

categorical_columns = [
    'make',
    'model',
    'type_of_vehicle',
    'transmission',
]

with open('./data/encode.json', 'r') as file:
    data = json.load(file)

for col, cate_dict in data.items():
    if col in df_test.columns:
        for index, row in df_test.iterrows():
            if row['model'] not in cate_dict.keys() and col == 'model':
                print(row['model'])

df_test = df_test[columns_to_keep]

num_records, num_attributes = df_test.shape
print("There are {} data points in test data, each with {} attributes.". format(num_records, num_attributes))

There are 10000 data points, each with 43 attributes.
fd7jjma
kluger
phaeton
hkl6540
tourer
lt434p
sh1eema
a160
sh1eema
cx-7
s350d
colorado
slyphy
vanguard
princess
bb
ya
fd7jpma
e350
dolphin
e-type
fvr90
ev
cwb45a
260e
clk230
230sl
eletre
clk280
genesis
e350
clk230
p5b
artura
trajet
c350
sierra
tourer
patrol
350sl
88
biturbo
flh290
meriva
artura
tarraco
cyz52r
mgb
seven
fs1elkd
fvr90
924
midget
k94ib4x2
daewoo
good
2002
clk280
sl500
xml6772
captiva
3336k
td
b7r
arnage
a170
3000gt
midget
1750
tong
gh8jrka
brooklands
9-5
alpine
924
348
vanguard
88
i40
2600
exige
mgb
sl280
sh1eema
356b
xml6772
safari
e280
ds
e-golf
sprinter
sl280
midget
300gd
midget
dolly
62
4
There are 10000 data points in test data, each with 16 attributes.


In [56]:
nan_counts = df_test[df_test.columns].isna().sum()

# Print the number of NaN values for each column
print('Test data')
print(nan_counts)

Test data
model                98
mileage            2166
low mileage car       0
manufactured          3
reg_year              0
type_of_vehicle       0
dereg_value          83
depreciation        201
power              1086
coe                   0
arf                  65
omv                  29
road_tax           1082
almost new car        0
coe car               0
parf car              0
dtype: int64


### Check if train data has all models in test data

In [47]:
models_in_df = set(df['model'].unique())
models_in_df_test = set(df_test['model'].unique())

if models_in_df_test.issubset(models_in_df):
    print("df includes all models in df_test")
else:
    missing_models = models_in_df_test - models_in_df
    print("df does not include", missing_models)

df does not include {np.float64(nan)}


### Mining code here

In [36]:
from util.DataMining import split_dataframe, split_dataframe_flex
from util.DataMining import (
    RandomForestMining,
    RandomForestMiningByModel,
    GradientBoostingMining,
    LinearRegressionMining,
    LinearRegressionMiningByModel,
    CombinedDataMiningRandomForestAndLinearRegression
)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [37]:
# run_times, rmse_sum = 5, 0
# for i in tqdm(range(run_times), desc='Running Random Forest'):
#     target_col = 'price'
#     x_train, x_test, y_train, y_test = split_dataframe(df, target_col)
#     rmse_sum += RandomForestMining(x_train, x_test, y_train, y_test)
# print('Average RMSE:', round(rmse_sum / run_times))

In [72]:
run_times, rmse_sum = 1, 0
for i in tqdm(range(run_times), desc='Running Random Forest'):
    train_drop_cols = ['price']
    test_cols = ['price', 'model']
    x_train, x_test, y_train, y_test = split_dataframe_flex(df_aug, train_drop_cols, test_cols)
    rmse_sum += RandomForestMiningByModel(x_train, x_test, y_train, y_test)
print('Average RMSE:', round(rmse_sum / run_times))

  y_pred_df = pd.concat([y_pred_df, temp_df], ignore_index=True)
Running Random Forest:  20%|█████████████████▌                                                                      | 1/5 [00:43<02:54, 43.66s/it]

   model  Prediction
0  731.0   161254.76
1  564.0   285340.00
2  372.0    53411.24
3  533.0    20509.92
4  483.0    77500.00
          price  model
19080  155800.0    731
17256  285000.0    564
2651    58000.0    372
13527   14800.0    533
20790   77500.0    483
Data saved to results.csv
Running not in develop mode
RMSE on test data: 13094.611381350667


  y_pred_df = pd.concat([y_pred_df, temp_df], ignore_index=True)
Running Random Forest:  40%|███████████████████████████████████▏                                                    | 2/5 [01:27<02:11, 43.68s/it]

   model     Prediction
0  705.0   37801.015231
1  705.0   48061.636598
2  705.0   23247.184314
3  232.0  118699.120000
4  649.0   58888.000000
          price  model
23487   38000.0    705
15680   46000.0    705
19218   22800.0    705
33767  118888.0    232
28798   58888.0    649
Data saved to results.csv
Running not in develop mode
RMSE on test data: 16004.42931582649


  y_pred_df = pd.concat([y_pred_df, temp_df], ignore_index=True)
Running Random Forest:  60%|████████████████████████████████████████████████████▊                                   | 3/5 [02:11<01:27, 43.76s/it]

   model  Prediction
0  752.0   270785.00
1  116.0    68557.37
2  454.0   139800.00
3   78.0   180000.00
4  610.0   202800.00
          price  model
38411  270800.0    752
16130   62800.0    116
25632  139800.0    454
27001  180000.0     78
33290  202800.0    610
Data saved to results.csv
Running not in develop mode
RMSE on test data: 18052.534842244153


  y_pred_df = pd.concat([y_pred_df, temp_df], ignore_index=True)
Running Random Forest:  80%|██████████████████████████████████████████████████████████████████████▍                 | 4/5 [02:54<00:43, 43.74s/it]

   model     Prediction
0  142.0   30668.543556
1  217.0   64000.000000
2  416.0   86652.960000
3  187.0   10600.000000
4  350.0  166270.600000
          price  model
19167   26800.0    142
28924   64000.0    217
17581   98800.0    416
25850   10600.0    187
15541  138000.0    350
Data saved to results.csv
Running not in develop mode
RMSE on test data: 13608.556815960093


  y_pred_df = pd.concat([y_pred_df, temp_df], ignore_index=True)
Running Random Forest: 100%|████████████████████████████████████████████████████████████████████████████████████████| 5/5 [03:40<00:00, 44.03s/it]

   model     Prediction
0  224.0  149753.060000
1  416.0   42011.466714
2  752.0  270800.000000
3  510.0   18833.706667
4  121.0   55149.282167
          price  model
19801  135800.0    224
16895   42500.0    416
38397  270800.0    752
15312   18800.0    510
2914    57800.0    121
Data saved to results.csv
Running not in develop mode
RMSE on test data: 13355.431496767686
Average RMSE: 14823





In [39]:
# x_train, y_train = df.drop(columns=['price']), df['price']
# x_test = df_test[x_train.columns]

# res = RandomForestMining(x_train, x_test, y_train, dev=True)
# res.to_csv('./data/res.csv', index=False)

In [41]:
x_train, y_train = df.drop(columns=['price']), df[['price', 'model']]
x_test = df_test[x_train.columns]
print(df_test.head())

res = RandomForestMiningByModel(x_train, x_test, y_train, dev=True)
res.to_csv('./data/res_by_model.csv', index=False)

   model   mileage  low mileage car  manufactured  reg_year  type_of_vehicle  \
0  705.0  112000.0                0        2015.0      2015                8   
1   36.0  120000.0                1        2007.0      2007                3   
2  221.0   43000.0                0        2019.0      2020                6   
3  707.0   53300.0                0        2019.0      2019                3   
4   36.0  149000.0                0        2015.0      2015                1   

   dereg_value  depreciation  power    coe      arf      omv  road_tax  \
0       9582.0       17660.0   96.0  57199   9229.0  19229.0     682.0   
1      13644.0       10920.0   79.0  42564  15782.0  14347.0    1113.0   
2      54818.0       22120.0  141.0  32801  47809.0  39863.0    1210.0   
3      26363.0       13700.0   79.0  29159  15573.0  15573.0     682.0   
4      15197.0       14190.0   88.0  56001  13097.0  18097.0     682.0   

   almost new car  coe car  parf car  
0               0        0         

  y_pred_df = pd.concat([y_pred_df, temp_df], ignore_index=True)


KeyError: np.float64(nan)