<img src="images/img.png" />

# CS5228 Project, Group 32

## Data Preprocessing
In this part, we are going to perform some data preprocessing steps. This may include:
* Data cleaning: handle missing values, duplicates, inconsistant or invalid vallues, outliers

* Data reduction: reduce number of attributes, reduce number of attribute values

* Data transformation: attribute construction, normalization

* Data discretization: encode to numerical attributes

### Setting up the Notebook

In [1]:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load file into pandas dataframe
df = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))
num_records, num_attributes = df_test.shape
print("There are {} data points in test data, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points in training data, each with 30 attributes.
There are 10000 data points in test data, each with 29 attributes.


### Data Cleaning

Before data cleaning, remove the known attributes that are not meaningful to our prediction model:
  * Meaningless idendifier: listing_id 
  * Attributes in free text: title, description, features, accessories
  * Attribute with the same value: eco_category, indicative_price
  * Attribute unlikely to affect price: curb_weight

In [3]:
columns_to_drop = [
    'listing_id',          # Meaningless identifier
    'title',               # Attributes in free text
    'description',
    'features',
    'accessories',
    'eco_category',        # Attribute with the same value
    'indicative_price',
    'curb_weight',         # Attribute unlikely to affect price

    'original_reg_date',
    'lifespan',
]

df = df.drop(columns=columns_to_drop)

num_records, num_attributes = df.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 25000 data points, each with 20 attributes.


### Handle Missing Values
Firstly, for each of the columns with missing value, check the number of rows with NaN values.
There are 3 scenarios:
1. NaN value is the major (e.g. fuel_type has 19121 rows with NaN values), we remove the corresponding attritubes.
2. NaN value is the minor. We can choose to fill or delete related data points. 

In [4]:
columns_to_check = [
    'make',
    'fuel_type',
    'manufactured',
    'power',
    'engine_cap',
    'mileage',
    'no_of_owners',
    'depreciation',
    'road_tax',
    'dereg_value',
    'omv',
    'arf',
    'opc_scheme'
]

# Calculate the number of NaN values in each specified column
nan_counts = df[columns_to_check].isna().sum()

# Print the number of NaN values for each column
for column, count in nan_counts.items():
    print(f"Column '{column}' has {count} rows with NaN values.")

Column 'make' has 1316 rows with NaN values.
Column 'fuel_type' has 19121 rows with NaN values.
Column 'manufactured' has 7 rows with NaN values.
Column 'power' has 2640 rows with NaN values.
Column 'engine_cap' has 596 rows with NaN values.
Column 'mileage' has 5304 rows with NaN values.
Column 'no_of_owners' has 18 rows with NaN values.
Column 'depreciation' has 507 rows with NaN values.
Column 'road_tax' has 2632 rows with NaN values.
Column 'dereg_value' has 220 rows with NaN values.
Column 'omv' has 64 rows with NaN values.
Column 'arf' has 174 rows with NaN values.
Column 'opc_scheme' has 24838 rows with NaN values.


We delete attributes with TOO many NaN value here.

In [5]:
columns_to_drop_nan = [
    'fuel_type',
    'opc_scheme'
]

df = df.drop(columns=columns_to_drop_nan)

Then we try to fill up other missing values.

In [6]:
from util.DataPreprocess import HandlingMissingValues

df = HandlingMissingValues(df)

NaN values after handling:  0


### Remove Exact Duplicates
We remove duplicated data points here.

In [7]:
df = df.drop_duplicates()
df_test = df_test.drop_duplicates()

num_records, num_attributes = df.shape
print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 24258 data points, each with 18 attributes.


### Merge rows with fewer data points on specific attributes

In [8]:
threshold = 2

value_counts = df['make'].value_counts()
categories_to_replace = value_counts[value_counts < threshold].index
df['make'] = df['make'].replace(categories_to_replace, 'others')

value_counts = df_test['make'].value_counts()
categories_to_replace = value_counts[value_counts < threshold].index
df_test['make'] = df_test['make'].replace(categories_to_replace, 'others')

### Transform categorical value to numerical values

In [9]:
categorical_columns = [
    'make',
    'model',
    'type_of_vehicle',
    'transmission',
]

le = LabelEncoder()
for column in categorical_columns:
    df[column] = le.fit_transform(df[column])
    df_test[column] = le.fit_transform(df_test[column])

### Transform date time attributes to numerical values

In [10]:
df['reg_date'] = pd.to_datetime(df['reg_date'], format='%d-%b-%Y')
df['reg_year'] = df['reg_date'].dt.year
df = df.drop(columns=['reg_date'])

num_records, num_attributes = df.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

There are 24258 data points, each with 18 attributes.


### Handle category attribute

In [11]:
# Replace '-' with an empty string
df['category'] = df['category'].replace('-', '')

# Split the 'category' column into lists
df['category_list'] = df['category'].str.split(', ')

# Handle empty strings by replacing them with empty lists
df['category_list'] = df['category_list'].apply(lambda x: [] if x == [''] else x)

# Import itertools for flattening lists
from itertools import chain

# Flatten the list of lists to a single list
all_categories = list(chain.from_iterable(df['category_list']))

# Get the unique categories
unique_categories = set(all_categories)

# Print the number of unique categories
print(f"Number of unique categories: {len(unique_categories)}")
print("Unique categories:", unique_categories)

# Initialize the MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the category lists
category_dummies = mlb.fit_transform(df['category_list'])

# Create a DataFrame with the one-hot encoded categories
category_df = pd.DataFrame(category_dummies, columns=mlb.classes_, index=df.index)

# Concatenate the new dummy columns to the original DataFrame
df = pd.concat([df, category_df], axis=1)

# Drop the temporary 'category_list' column if desired
df.drop('category_list', axis=1, inplace=True)
df.drop('category', axis=1, inplace=True)

num_records, num_attributes = df.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

Number of unique categories: 15
Unique categories: {'direct owner sale', 'parf car', 'hybrid cars', 'low mileage car', 'sgcarmart warranty cars', 'electric cars', 'imported used vehicle', 'coe car', 'sta evaluated car', 'consignment car', 'opc car', 'premium ad car', 'almost new car', 'vintage cars', 'rare & exotic'}
There are 24258 data points, each with 32 attributes.


### Saving the Data

In [12]:
file_name = './data/train_preprocessed.csv'
# file_name = '../data/preprocessed/test_preprocessed.csv'


# Check if the file exists
if os.path.exists(file_name):
    # Delete the file
    os.remove(file_name)
    print(f"Existing file '{file_name}' has been deleted.")

# Save the DataFrame to CSV
df.to_csv(file_name, index=False)
print(f"DataFrame has been saved to '{file_name}'.")

Existing file './data/train_preprocessed.csv' has been deleted.
DataFrame has been saved to './data/train_preprocessed.csv'.


## Data Mining

### We load our preprocessed data first

In [13]:
# Load file into pandas dataframe, we saved our preprocessed file at path 'output_file'
output_file = './data/train_preprocessed.csv'
df = pd.read_csv(output_file)

columns_to_keep = [
    'model',
    'dereg_value',
    'arf',
    'omv',
    'depreciation',
    'power',
    'coe',
    'price',
]

df = df[columns_to_keep]
columns_to_keep = [col for col in df.columns if col != 'price']
df_test = df_test[columns_to_keep]

num_records, num_attributes = df.shape
print("There are {} data points in training data, each with {} attributes.". format(num_records, num_attributes))
num_records, num_attributes = df_test.shape
print("There are {} data points in test data, each with {} attributes.". format(num_records, num_attributes))

There are 24258 data points in training data, each with 8 attributes.
There are 10000 data points in test data, each with 7 attributes.


### Mining code here

In [14]:
from util.DataMining import split_dataframe, split_dataframe_flex
from util.DataMining import RandomForestMining
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [15]:
# Grid search， running these code lines take about 30 minutes
search_grid = False
if search_grid:
    param_grid = {'max_depth': range(1, 21)}
    grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
    grid_search.fit(x_train, y_train)
    print("Best max_depth:", grid_search.best_params_['max_depth'])

In [18]:
# Running algo. not in development mode, i.e. just train and test on the given training set
run_times, rmse_sum = 10, 0
for i in tqdm(range(run_times), desc='Running Random Forest'):
    target_col = 'price'
    x_train, x_test, y_train, y_test = split_dataframe(df, target_col)
    rmse_sum += RandomForestMining(x_train, x_test, y_train, y_test)
print('Average RMSE:', round(rmse_sum / run_times))

Running Random Forest:  10%|█████████▏                                                                                  | 1/10 [00:21<03:15, 21.77s/it]

Running not in develop mode
RMSE on test data: 26831.35310417431


Running Random Forest:  20%|██████████████████▍                                                                         | 2/10 [00:43<02:53, 21.64s/it]

Running not in develop mode
RMSE on test data: 21356.468910706317


Running Random Forest:  30%|███████████████████████████▌                                                                | 3/10 [01:04<02:30, 21.49s/it]

Running not in develop mode
RMSE on test data: 22463.448459248386


Running Random Forest:  40%|████████████████████████████████████▊                                                       | 4/10 [01:26<02:09, 21.62s/it]

Running not in develop mode
RMSE on test data: 18703.78398895981


Running Random Forest:  50%|██████████████████████████████████████████████                                              | 5/10 [01:47<01:47, 21.57s/it]

Running not in develop mode
RMSE on test data: 37058.27806267251


Running Random Forest:  60%|███████████████████████████████████████████████████████▏                                    | 6/10 [02:09<01:26, 21.59s/it]

Running not in develop mode
RMSE on test data: 24403.455874692758


Running Random Forest:  70%|████████████████████████████████████████████████████████████████▍                           | 7/10 [02:30<01:04, 21.46s/it]

Running not in develop mode
RMSE on test data: 20795.48500747697


Running Random Forest:  80%|█████████████████████████████████████████████████████████████████████████▌                  | 8/10 [02:52<00:42, 21.40s/it]

Running not in develop mode
RMSE on test data: 21637.362161735087


Running Random Forest:  90%|██████████████████████████████████████████████████████████████████████████████████▊         | 9/10 [03:13<00:21, 21.42s/it]

Running not in develop mode
RMSE on test data: 22595.62584420286


Running Random Forest: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:34<00:00, 21.49s/it]

Running not in develop mode
RMSE on test data: 19426.57701484216
Average RMSE: 1943





In [20]:
# Running algo. in development mode, i.e. train and give final result
result_path = './data/res2.csv'
target_col = 'price'
X_train, X_test, y_train = df.drop(columns='price'), df_test, df['price']
pred = RandomForestMining(X_train, X_test, y_train, dev=True)

dir_path = os.path.dirname(result_path)
if not os.path.exists(dir_path):
    os.makedirs(dir_path)
pred.to_csv(result_path, index=False)