<img src='images/img.png' />

# CS5228 Project, Group 32

## Data Preprocessing
In this part, we are going to perform some data preprocessing steps. This may include:
* Data cleaning: handle missing values, duplicates, inconsistant or invalid vallues, outliers

* Data reduction: reduce number of attributes, reduce number of attribute values

* Data transformation: attribute construction, normalization

* Data discretization: encode to numerical attributes

### Setting up the Notebook

In [1]:
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm import tqdm

In [2]:
# Load file into pandas dataframe
df = pd.read_csv('./data/train.csv')

num_records, num_attributes = df.shape

print('There are {} data points, each with {} attributes.'. format(num_records, num_attributes))

There are 25000 data points, each with 30 attributes.


### Data Cleaning

Before data cleaning, remove the known attributes that are not meaningful to our prediction model:
  * Meaningless idendifier: listing_id 
  * Attributes in free text: title, description, features, accessories
  * Attribute with the same value: eco_category, indicative_price
  * Attribute unlikely to affect price: curb_weight

In [3]:
columns_to_drop = [
    'listing_id',          # Meaningless identifier
    'title',               # Attributes in free text
    'description',
    'features',
    'accessories',
    'eco_category',        # Attribute with the same value
    'indicative_price',
    'curb_weight',         # Attribute unlikely to affect price

    'original_reg_date',
    'lifespan',
]

df = df.drop(columns=columns_to_drop)

num_records, num_attributes = df.shape

print('There are {} data points, each with {} attributes.'. format(num_records, num_attributes))

There are 25000 data points, each with 20 attributes.


### Handle Missing Values
Firstly, for each of the columns with missing value, check the number of rows with NaN values.
There are 3 scenarios:
1. NaN value is the major (e.g. fuel_type has 19121 rows with NaN values), we remove the corresponding attritubes.
2. NaN value is the minor. We can choose to fill or delete related data points. 

In [4]:
columns_to_check = [
    'make',
    'fuel_type',
    'manufactured',
    'power',
    'engine_cap',
    'mileage',
    'no_of_owners',
    'depreciation',
    'road_tax',
    'dereg_value',
    'omv',
    'arf',
    'opc_scheme'
]

# Calculate the number of NaN values in each specified column
nan_counts = df[columns_to_check].isna().sum()

# Print the number of NaN values for each column
for column, count in nan_counts.items():
    print(f'Column "{column}" has {count} rows with NaN values.')

Column "make" has 1316 rows with NaN values.
Column "fuel_type" has 19121 rows with NaN values.
Column "manufactured" has 7 rows with NaN values.
Column "power" has 2640 rows with NaN values.
Column "engine_cap" has 596 rows with NaN values.
Column "mileage" has 5304 rows with NaN values.
Column "no_of_owners" has 18 rows with NaN values.
Column "depreciation" has 507 rows with NaN values.
Column "road_tax" has 2632 rows with NaN values.
Column "dereg_value" has 220 rows with NaN values.
Column "omv" has 64 rows with NaN values.
Column "arf" has 174 rows with NaN values.
Column "opc_scheme" has 24838 rows with NaN values.


We delete attributes with TOO many NaN value here.

In [5]:
columns_to_drop_nan = [
    'fuel_type',
    'opc_scheme'
]

df = df.drop(columns=columns_to_drop_nan)

Then we try to fill up other missing values.

In [6]:
from util.DataPreprocess import HandlingMissingValues

df = HandlingMissingValues(df)

NaN values after handling:  0


### Remove Exact Duplicates
We remove duplicated data points here.

In [7]:
df = df.drop_duplicates()

num_records, num_attributes = df.shape

print('There are {} data points, each with {} attributes.'. format(num_records, num_attributes))

There are 24258 data points, each with 18 attributes.


### Transform categorical value to numerical values

In [8]:
categorical_columns = [
    'make',
    'model',
    'type_of_vehicle',
    'transmission',
]

le = LabelEncoder()
for column in categorical_columns:
    df[column] = le.fit_transform(df[column])

### Transform date time attributes to numerical values

In [9]:
df['reg_date'] = pd.to_datetime(df['reg_date'], format='%d-%b-%Y')
df['reg_year'] = df['reg_date'].dt.year
df = df.drop(columns=['reg_date'])

num_records, num_attributes = df.shape

print('There are {} data points, each with {} attributes.'. format(num_records, num_attributes))

There are 24258 data points, each with 18 attributes.


### Data Encoding

In [10]:
from util.DataPreprocess import DataEncoding

df = DataEncoding(df)

Number of unique categories: 15
Unique categories: {'opc car', 'coe car', 'direct owner sale', 'consignment car', 'parf car', 'rare & exotic', 'sgcarmart warranty cars', 'sta evaluated car', 'hybrid cars', 'vintage cars', 'imported used vehicle', 'low mileage car', 'premium ad car', 'almost new car', 'electric cars'}
There are 24258 data points, each with 32 attributes.


### Outliers removal

In [11]:
from util.DataPreprocess import OutlierRemoval
df = OutlierRemoval(df)

There are 23923 data points, each with 32 attributes


### Saving the Data

In [12]:
output_file = './data/train_preprocessed.csv'


# Check if the file exists
if os.path.exists(output_file):
    # Delete the file
    os.remove(output_file)
    print(f'Existing file "{output_file}" has been deleted.')

# Save the DataFrame to CSV
df.to_csv(output_file, index=False)
print(f'DataFrame has been saved to "{output_file}".')

Existing file "./data/train_preprocessed.csv" has been deleted.
DataFrame has been saved to "./data/train_preprocessed.csv".


## Data Mining

### We load our preprocessed data first

In [15]:
# Load file into pandas dataframe, we saved our preprocessed file at path 'output_file'
output_file = './data/train_preprocessed.csv'
df = pd.read_csv(output_file)

num_records, num_attributes = df.shape
print('There are {} data points, each with {} attributes.'. format(num_records, num_attributes))
columns_to_keep = [
    'dereg_value',
    'arf',
    'omv',
    'depreciation',
    'power',
    'coe',
    'price',
]

df = df[columns_to_keep]

There are 23923 data points, each with 32 attributes.


### Your mining code here

In [16]:
from util.DataMining import split_dataframe, RandomForestMining

run_times, rmse_sum = 10, 0
for i in tqdm(range(run_times), desc='Running Random Foresr'):
    target_col = 'price'
    x_train, x_test, y_train, y_test = split_dataframe(df, target_col)
    rmse_sum += RandomForestMining(x_train, x_test, y_train, y_test)
print('Average RMSE:', round(rmse_sum / run_times))

Running Random Foresr:  10%|████████▍                                                                           | 1/10 [00:07<01:07,  7.46s/it]

RMSE on test data: 24460.462438243812


Running Random Foresr:  20%|████████████████▊                                                                   | 2/10 [00:15<01:01,  7.67s/it]

RMSE on test data: 25977.98556643058


Running Random Foresr:  30%|█████████████████████████▏                                                          | 3/10 [00:23<00:54,  7.75s/it]

RMSE on test data: 26827.63660438347


Running Random Foresr:  40%|█████████████████████████████████▌                                                  | 4/10 [00:31<00:46,  7.80s/it]

RMSE on test data: 19414.167713938587


Running Random Foresr:  50%|██████████████████████████████████████████                                          | 5/10 [00:38<00:39,  7.82s/it]

RMSE on test data: 15696.621630034937


Running Random Foresr:  60%|██████████████████████████████████████████████████▍                                 | 6/10 [00:46<00:31,  7.76s/it]

RMSE on test data: 20990.00535243494


Running Random Foresr:  70%|██████████████████████████████████████████████████████████▊                         | 7/10 [00:54<00:23,  7.73s/it]

RMSE on test data: 20795.712402475234


Running Random Foresr:  80%|███████████████████████████████████████████████████████████████████▏                | 8/10 [01:01<00:15,  7.68s/it]

RMSE on test data: 18729.787408218075


Running Random Foresr:  90%|███████████████████████████████████████████████████████████████████████████▌        | 9/10 [01:09<00:07,  7.67s/it]

RMSE on test data: 19881.737914232355


Running Random Foresr: 100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [01:16<00:00,  7.69s/it]

RMSE on test data: 17021.24361348311
Average RMSE: 20980



