<img src="images/img.png" />

# CS5228 Project, Group 32

In [20]:
# Auto reload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [21]:
import os
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Train data preprocess
In this part, we are going to perform some data preprocessing steps. This may include:
* Data cleaning: handle missing values, duplicates, inconsistant or invalid vallues, outliers

* Data reduction: reduce number of attributes, reduce number of attribute values

* Data transformation: attribute construction, normalization

* Data discretization: encode to numerical attributes

### Load the train dataset

In [22]:
df = pd.read_csv('./data/train.csv')

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 30 attributes.


### Data Reduction

Before data cleaning, remove the known attributes that are not meaningful to our prediction model:
  * Meaningless idendifier: listing_id 
  * Attributes in free text: description, features, accessories. title is kept for now because it was used to fill missing values for make
  * Attributes with the same value: eco_category, indicative_price
  * Attributes with 2 values: transmission
  * Attributes unlikely to affect price: curb_weight

We first drop columns with free text.

In [23]:
columns_to_drop = [
    'listing_id', 

    'description',
    'features',
    'accessories',

    'eco_category', 
    'indicative_price',
    'transmission',

    'curb_weight',
]

df = df.drop(columns=columns_to_drop)

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 22 attributes.


### Data Cleaning
Firstly, for each of the columns with missing value, check the number of rows with NaN values.
There are 3 scenarios:
1. NaN value is the major (e.g. fuel_type has 19121 rows with NaN values), we remove the corresponding attritubes.
2. NaN value is the minor. We can choose to fill or delete related data points. 

In [24]:
nan_counts = df.isna().sum()

for column, count in nan_counts.items():
    print(f"Column '{column}' has {count} rows with NaN values.")

Column 'title' has 0 rows with NaN values.
Column 'make' has 1316 rows with NaN values.
Column 'model' has 0 rows with NaN values.
Column 'manufactured' has 7 rows with NaN values.
Column 'original_reg_date' has 24745 rows with NaN values.
Column 'reg_date' has 0 rows with NaN values.
Column 'type_of_vehicle' has 0 rows with NaN values.
Column 'category' has 0 rows with NaN values.
Column 'power' has 2640 rows with NaN values.
Column 'fuel_type' has 19121 rows with NaN values.
Column 'engine_cap' has 596 rows with NaN values.
Column 'no_of_owners' has 18 rows with NaN values.
Column 'depreciation' has 507 rows with NaN values.
Column 'coe' has 0 rows with NaN values.
Column 'road_tax' has 2632 rows with NaN values.
Column 'dereg_value' has 220 rows with NaN values.
Column 'mileage' has 5304 rows with NaN values.
Column 'omv' has 64 rows with NaN values.
Column 'arf' has 174 rows with NaN values.
Column 'opc_scheme' has 24838 rows with NaN values.
Column 'lifespan' has 22671 rows with N

We first drop columns with TOO MANY NaN values and unlikely to help prediction

In [25]:
columns_to_drop_nan = [
    'fuel_type',
    'opc_scheme',
    'original_reg_date',
    'lifespan',
]

for col in columns_to_drop_nan:
    if col in df.columns:
        df = df.drop(columns=[col])

### Data transformation

Introduce a new attribute `car_age`. It's calculated by the current year - year of `reg_date`

In [26]:
from util.DataPreprocess import CalculateCarAge
    
df = CalculateCarAge(df)
num_records, num_attributes = df.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points, each with 19 attributes.


Fill column `make` with `title`, we notice that `make` is always mentioned in the `title`.

In [27]:
make_set = []

for index, row in df.iterrows():
    if pd.notna(row['make']) and row['make'] in row['title'].lower():
        make_set.append(row['make'])
        
for index, row in df.iterrows():
    if pd.isna(row['make']):
        for make in make_set:
            if make in row['title'].lower():
                df.at[index, 'make'] = make.lower()
                break
                
df = df.drop(columns=['title'])

Transform categorical data to numerical data. We first use LabelEncoder to do encoding and save the label mapping.

In [None]:
categorical_columns = [
    'model',
    'make',
    'type_of_vehicle',
]

encode_dict = {}
le = LabelEncoder()
for column in categorical_columns:
    df[column] = le.fit_transform(df[column])
    encode_dict[column] = {str(label): int(index) for index, label in enumerate(le.classes_)}

with open('./data/encode.json', 'w') as file:
    json.dump(encode_dict, file, indent=4)

Transform the comma seperated categories in `category` using one-hot encoding.

In [29]:
from util.DataPreprocess import HandlingCategoryAttribute

if 'category' in df.columns:
    df = HandlingCategoryAttribute(df)
    
print(df.columns)

Number of unique categories: 15
Unique categories: {'vintage cars', 'electric cars', 'premium ad car', 'low mileage car', 'coe car', 'sgcarmart warranty cars', 'direct owner sale', 'opc car', 'imported used vehicle', 'parf car', 'consignment car', 'almost new car', 'hybrid cars', 'sta evaluated car', 'rare & exotic'}
There are 25000 data points, each with 32 attributes.
Index(['make', 'model', 'manufactured', 'type_of_vehicle', 'power',
       'engine_cap', 'no_of_owners', 'depreciation', 'coe', 'road_tax',
       'dereg_value', 'mileage', 'omv', 'arf', 'price', 'reg_year', 'car_age',
       'almost new car', 'coe car', 'consignment car', 'direct owner sale',
       'electric cars', 'hybrid cars', 'imported used vehicle',
       'low mileage car', 'opc car', 'parf car', 'premium ad car',
       'rare & exotic', 'sgcarmart warranty cars', 'sta evaluated car',
       'vintage cars'],
      dtype='object')


In [30]:
total_nulls = df.isnull().sum()
print(total_nulls)

make                          0
model                         0
manufactured                  7
type_of_vehicle               0
power                      2640
engine_cap                  596
no_of_owners                 18
depreciation                507
coe                           0
road_tax                   2632
dereg_value                 220
mileage                    5304
omv                          64
arf                         174
price                         0
reg_year                      0
car_age                       0
almost new car                0
coe car                       0
consignment car               0
direct owner sale             0
electric cars                 0
hybrid cars                   0
imported used vehicle         0
low mileage car               0
opc car                       0
parf car                      0
premium ad car                0
rare & exotic                 0
sgcarmart warranty cars       0
sta evaluated car             0
vintage 

### We try to fill up other missing values.

In [31]:
from util.DataPreprocess import HandlingMissingValue
from util.DataPreprocess import HandlingMissingValueWithImpute

columns = df.columns
print(df.columns)

df = HandlingMissingValue(df)
df = HandlingMissingValueWithImpute(df, columns)

total_nulls = df.isnull().sum()
print(total_nulls)

Index(['make', 'model', 'manufactured', 'type_of_vehicle', 'power',
       'engine_cap', 'no_of_owners', 'depreciation', 'coe', 'road_tax',
       'dereg_value', 'mileage', 'omv', 'arf', 'price', 'reg_year', 'car_age',
       'almost new car', 'coe car', 'consignment car', 'direct owner sale',
       'electric cars', 'hybrid cars', 'imported used vehicle',
       'low mileage car', 'opc car', 'parf car', 'premium ad car',
       'rare & exotic', 'sgcarmart warranty cars', 'sta evaluated car',
       'vintage cars'],
      dtype='object')
NaN values after handling:  0
make                       0
model                      0
manufactured               0
type_of_vehicle            0
power                      0
engine_cap                 0
no_of_owners               0
depreciation               0
coe                        0
road_tax                   0
dereg_value                0
mileage                    0
omv                        0
arf                        0
price               

In [32]:
df.head()

Unnamed: 0,make,model,manufactured,type_of_vehicle,power,engine_cap,no_of_owners,depreciation,coe,road_tax,...,hybrid cars,imported used vehicle,low mileage car,opc car,parf car,premium ad car,rare & exotic,sgcarmart warranty cars,sta evaluated car,vintage cars
0,43.0,595.0,2018.0,8.0,280.0,2995.0,2.0,34270.0,48011.0,2380.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,51.0,192.0,2017.0,2.0,135.0,1991.0,2.0,21170.0,47002.0,1202.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
2,29.0,546.0,2007.0,4.0,118.0,2354.0,3.0,12520.0,50355.0,2442.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,88.0,156.0,2008.0,3.0,80.0,1598.0,3.0,10140.0,27571.0,1113.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,44.0,398.0,2006.0,2.0,183.0,2995.0,6.0,13690.0,48479.0,3570.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### Remove Exact Duplicates
We remove duplicated data points here.

In [33]:
df = df.drop_duplicates()

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 24421 data points in training data, each with 32 attributes.


### Introduce new attributes `omv_arf_ratio` and `dereg_coe_ratio`

In [34]:
from util.DataPreprocess import DataCalculation

df = DataCalculation(df)

### Save the preprocessed data

In [35]:
file_name = './data/train_preprocessed_impute.csv'

if os.path.exists(file_name):
    os.remove(file_name)
    print(f"Existing file '{file_name}' has been deleted.")

df.to_csv(file_name, index=False)
print(f"DataFrame has been saved to '{file_name}'.")

Existing file './data/train_preprocessed_impute.csv' has been deleted.
DataFrame has been saved to './data/train_preprocessed_impute.csv'.


### Load preprocessed training data

In [None]:
training_file = './data/train_preprocessed_impute.csv'
df = pd.read_csv(training_file)

columns_to_keep = [col for col in df.columns if col != 'price']

print(df.columns)
num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

Index(['make', 'model', 'manufactured', 'type_of_vehicle', 'power',
       'engine_cap', 'no_of_owners', 'depreciation', 'coe', 'road_tax',
       'dereg_value', 'mileage', 'omv', 'arf', 'price', 'reg_year', 'car_age',
       'almost new car', 'coe car', 'consignment car', 'direct owner sale',
       'electric cars', 'hybrid cars', 'imported used vehicle',
       'low mileage car', 'opc car', 'parf car', 'premium ad car',
       'rare & exotic', 'sgcarmart warranty cars', 'sta evaluated car',
       'vintage cars', 'omv_arf_ratio', 'dereg_coe_ratio'],
      dtype='object')
There are 24421 data points in training data, each with 34 attributes.


### Data Augmentation, copy rows with less than 5 samples by group. The augmentated data is used only for validation

In [37]:
from util.DataPreprocess import DataAugmentation

df_aug = DataAugmentation(df)

num_records, num_attributes = df_aug.shape
print("There are {} data points after augmentation, each with {} attributes.". format(num_records, num_attributes))

There are 40705 data points after augmentation, each with 34 attributes.


### Save the augmentation data

In [38]:
file_name = './data/train_preprocessed_augmentation.csv'

if os.path.exists(file_name):
    os.remove(file_name)
    print(f"Existing file '{file_name}' has been deleted.")

df_aug.to_csv(file_name, index=False)
print(f"DataFrame has been saved to '{file_name}'.")

Existing file './data/train_preprocessed_augmentation.csv' has been deleted.
DataFrame has been saved to './data/train_preprocessed_augmentation.csv'.


## Preprocess test data

Load test data and preprocess. We also need training data to help here.

In [51]:
train_file = './data/train_preprocessed_impute.csv'
df = pd.read_csv(train_file)

test_file = './data/test.csv'
df_test = pd.read_csv(test_file)

### Drop columns

In [52]:
columns_to_drop = [
    'listing_id', 
    'description',
    'features',
    'accessories',
    'eco_category',   
    'indicative_price',
    'curb_weight',       
    'transmission',
    'original_reg_date',
    'lifespan',
]

df_test = df_test.drop(columns=columns_to_drop)

num_records, num_attributes = df_test.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))

There are 10000 data points in training data, each with 19 attributes.


### Calculate car age

In [53]:
from util.DataPreprocess import CalculateCarAge
df_test = CalculateCarAge(df_test)

num_records, num_attributes = df_test.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 10000 data points, each with 20 attributes.


### Convert category data

In [54]:
from util.DataPreprocess import HandlingCategoryAttribute

if 'category' in df_test.columns:
    df_test = HandlingCategoryAttribute(df_test)

Number of unique categories: 15
Unique categories: {'vintage cars', 'electric cars', 'premium ad car', 'sgcarmart warranty cars', 'coe car', 'low mileage car', 'direct owner sale', 'opc car', 'imported used vehicle', 'parf car', 'almost new car', 'consignment car', 'hybrid cars', 'sta evaluated car', 'rare & exotic'}
There are 10000 data points, each with 34 attributes.


### Handle missing values on test data similar to training data

In [55]:
columns_to_drop_nan = [
    'fuel_type',
    'opc_scheme',
]

for col in columns_to_drop_nan:
    if col in df_test.columns:
        df_test = df_test.drop(columns=[col])

In [56]:
make_set = []

for index, row in df_test.iterrows():
    if pd.notna(row['make']) and row['make'] in row['title'].lower():
        make_set.append(row['make'])
        
for index, row in df_test.iterrows():
    if pd.isna(row['make']):
        for make in make_set:
            if make in row['title'].lower():
                df_test.at[index, 'make'] = make.lower()
                break
                
df_test = df_test.drop(columns=['title'])

In [57]:
from util.DataPreprocess import HandlingCategoryAttribute

if 'category' in df_test.columns:
    df_test = HandlingCategoryAttribute(df_test)

In [59]:
from util.DataPreprocess import DataCalculation

df_test = DataCalculation(df_test)

### Encode attributes on test data

In [60]:
num_records, num_attributes = df_test.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

categorical_columns = [
    'make',
    'model',
    'type_of_vehicle'
]

with open('./data/encode.json', 'r') as file:
    data = json.load(file)

new_encodings = {}

for col, cate_dict in data.items():
    if col in df_test.columns and col in categorical_columns:
        original_values = df_test[col].copy()
        
        df_test[col] = df_test[col].map(cate_dict)
        
        missing = df_test[col][df_test[col].isna()]
        if not missing.empty:
            original_missing_values = original_values[missing.index]
            unique_missing_values = original_missing_values.unique()
            new_encoding_dict = {val: idx for idx, val in enumerate(unique_missing_values, start=max(cate_dict.values()) + 1)}
            new_encodings[col] = new_encoding_dict
            
            df_test[col].fillna(original_missing_values.map(new_encoding_dict), inplace=True)

if new_encodings:
    print("New encodings created for missing values:")
    for col, encodings in new_encodings.items():
        print(f"Column: {col}, New encodings: {encodings}")
else:
    print("No new encodings were necessary.")

print(df_test.head())

num_records, num_attributes = df_test.shape
print(f"There are {num_records} data points in test data, each with {num_attributes} attributes.")

There are 10000 data points, each with 33 attributes.
New encodings created for missing values:
Column: model, New encodings: {'fd7jjma': 799, 'kluger': 800, 'phaeton': 801, 'tourer': 802, 'lt434p': 803, 'a160': 804, 'cx-7': 805, 's350d': 806, 'colorado': 807, 'slyphy': 808, 'vanguard': 809, 'princess': 810, 'bb': 811, 'ya': 812, 'fd7jpma': 813, 'e350': 814, 'e-type': 815, 'fvr90': 816, 'ev': 817, 'cwb45a': 818, '260e': 819, 'clk230': 820, 'eletre': 821, 'clk280': 822, 'genesis': 823, 'p5b': 824, 'artura': 825, 'trajet': 826, 'c350': 827, 'sierra': 828, 'patrol': 829, '350sl': 830, 'biturbo': 831, 'flh290': 832, 'meriva': 833, 'tarraco': 834, 'cyz52r': 835, 'seven': 836, 'fs1elkd': 837, '924': 838, 'k94ib4x2': 839, 'daewoo': 840, 'sl500': 841, 'xml6772': 842, 'captiva': 843, '3336k': 844, 'td': 845, 'b7r': 846, 'arnage': 847, 'a170': 848, '3000gt': 849, 'tong': 850, 'gh8jrka': 851, 'brooklands': 852, '9-5': 853, '348': 854, 'i40': 855, '2600': 856, 'exige': 857, '356b': 858, 'safari': 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test[col].fillna(original_missing_values.map(new_encoding_dict), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test[col].fillna(original_missing_values.map(new_encoding_dict), inplace=True)


In [61]:
from util.DataPreprocess import HandlingMissingValueWithImputeReference
from util.DataPreprocess import HandlingMissingValueTest

columns = df_test.columns
print(df.columns)

df_test = HandlingMissingValueTest(df_test)
df_test = HandlingMissingValueWithImputeReference(df_test, df, columns)

total_nulls = df_test.isnull().sum()
print(total_nulls)

df_test.to_csv('./data/test_preprocessed.csv')

Index(['make', 'model', 'manufactured', 'type_of_vehicle', 'power',
       'engine_cap', 'no_of_owners', 'depreciation', 'coe', 'road_tax',
       'dereg_value', 'mileage', 'omv', 'arf', 'price', 'reg_year', 'car_age',
       'almost new car', 'coe car', 'consignment car', 'direct owner sale',
       'electric cars', 'hybrid cars', 'imported used vehicle',
       'low mileage car', 'opc car', 'parf car', 'premium ad car',
       'rare & exotic', 'sgcarmart warranty cars', 'sta evaluated car',
       'vintage cars', 'omv_arf_ratio', 'dereg_coe_ratio'],
      dtype='object')
NaN values after handling:  4020




   make  model  manufactured  type_of_vehicle  power  engine_cap  \
0  29.0  746.0        2015.0              8.0   96.0      1496.0   
1  49.0   41.0        2007.0              3.0   79.0      1598.0   
2  53.0  235.0        2019.0              6.0  141.0      1998.0   
3  88.0  748.0        2019.0              3.0   79.0      1496.0   
4  49.0   41.0        2015.0              1.0   88.0      1496.0   

   no_of_owners  depreciation      coe  road_tax  ...  low mileage car  \
0           2.0       17660.0  57199.0     682.0  ...              0.0   
1           1.0       10920.0  42564.0    1113.0  ...              1.0   
2           1.0       22120.0  32801.0    1210.0  ...              0.0   
3           3.0       13700.0  29159.0     682.0  ...              0.0   
4           3.0       14190.0  56001.0     682.0  ...              0.0   

   opc car  parf car  premium ad car  rare & exotic  sgcarmart warranty cars  \
0      0.0       1.0             0.0            0.0               

In [62]:
df_test.head()

Unnamed: 0,make,model,manufactured,type_of_vehicle,power,engine_cap,no_of_owners,depreciation,coe,road_tax,...,low mileage car,opc car,parf car,premium ad car,rare & exotic,sgcarmart warranty cars,sta evaluated car,vintage cars,omv_arf_ratio,dereg_coe_ratio
0,29.0,746.0,2015.0,8.0,96.0,1496.0,2.0,17660.0,57199.0,682.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.083541,0.16752
1,49.0,41.0,2007.0,3.0,79.0,1598.0,1.0,10920.0,42564.0,1113.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.909074,0.320553
2,53.0,235.0,2019.0,6.0,141.0,1998.0,1.0,22120.0,32801.0,1210.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.833797,1.67123
3,88.0,748.0,2019.0,3.0,79.0,1496.0,3.0,13700.0,29159.0,682.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.904112
4,49.0,41.0,2015.0,1.0,88.0,1496.0,3.0,14190.0,56001.0,682.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.381767,0.27137


In [63]:
len(df_test)

10000