# CarList Listings

This project aims to analyse car listings data scraped from the Carlist.my website. Each group member was assigned a different range of pages to scrape, resulting in multiple datasets that will be merged into one unified dataset for further analysis. At this stage, the exact number of rows, columns, and memory usage is unknown, as the datasets have yet to be combined.

The project will involve essential data preprocessing steps including merging, cleaning, and transformation to ensure data consistency and usability. Following that, we will apply performance optimization techniques using various library. Two optimization methods will be implemented and compared against the baseline (non-optimized) version to observe their effect on speed and efficiency.

**Group Members:**
---

| Name          | Matrix Number | Library Used |
|---------------|---------------|---------------|
| Marcus Joey Sayner         | A22EC0193     | Polars        |
| Muhammad Luqman Hakim bin Mohd Rizaudin           | A22EC0086     | Modin          |
| Goh Jing Yang       | A22EC0052     | Dask         |
| Camily Tang Jia Lei          | A22EC0039     | Pandas        |


## 1. Combine Dataset

To prepare the dataset for analysis, we first combined the individual files collected by each group member: Camily, Luqman, Jing Yang, and Marcus. Each member scraped a different range of pages from the Carlist.my website and saved the data as a CSV file. These files were then loaded using the pandas library and concatenated into one comprehensive dataset. This combined dataset is essential for performing consistent data cleaning, transformation, and performance optimization across all entries.

In [2]:
%pip install pandas pymongo

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import pandas as pd
from pymongo import MongoClient

# List of group member scraaping files
files = [
    "camily_carlist_listings.csv",
    "luqman_carlist_listings.csv",
    "jingyang_carlist_listings.csv",
    "marcus_carlist_listings.csv"
]

# Read and combine datasets
dfs = [pd.read_csv(file) for file in files]
combined_df = pd.concat(dfs, ignore_index=True)

# Save combined dataset
combined_df.to_csv("carlist_combined.csv", index=False)

# Output shape and preview
print("Combined dataset shape:", combined_df.shape)
print(combined_df.head(5))

# Connect to MongoDB 
client = MongoClient("mongodb://localhost:27017")
db = client["carlist_db"]
collection = db["listings"]

# Insert combined data into MongoDB
collection.insert_many(combined_df.to_dict(orient="records"))
print("✅ Data inserted into MongoDB successfully.")


Combined dataset shape: (174150, 17)
                                            Car Name Car Brand      Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus          RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota         Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche        Cayenne   
3             2021 Honda City 1.5 V i-VTEC Hatchback     Honda           City   
4  2022 Toyota Corolla Cross 1.8 V SUV Full Servi...    Toyota  Corolla Cross   

   Manufacture Year  Body Type                Fuel Type            Mileage  \
0              2023        SUV  Petrol - Unleaded (ULP)     5 - 10K KM       
1              2010        MPV  Petrol - Unleaded (ULP)  115 - 120K KM       
2              2020      Coupe  Petrol - Unleaded (ULP)    20 - 25K KM       
3              2021  Hatchback  Petrol - Unleaded (ULP)    90 - 95K KM       
4              2022        SUV  Petrol - Unleaded (ULP)    80 - 85K KM       

  Trans

## 2. Data Preparation and Cleaning

In this phase of the project, we focus on preparing the combined dataset for analysis by cleaning and transforming the data. This step involves addressing common issues such as missing values, standardizing column names, removing irrelevant columns, and correcting any inconsistencies in the data. We will also handle data types to ensure that all columns are in the appropriate format for analysis. Proper data cleaning is essential to ensure the accuracy and consistency of the dataset before applying any optimization techniques or further analysis.

### Install and Import Libraries

**1. Pandas**

In [4]:
import pandas as pd
import numpy as np
import re
import time
import psutil
import os
import tracemalloc
from pymongo import MongoClient

**2. Polars**

In [5]:
%pip install polars

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import polars as pl
import numpy as np
import re
import time
import psutil
import os
import tracemalloc
from pymongo import MongoClient

**3. Modin**

In [7]:
%pip install modin[dask] dask distributed

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import os
os.environ["MODIN_ENGINE"] = "dask"

import modin.pandas as md
import numpy as np
import psutil
import re
import tracemalloc
import time

**4. Dask**

In [9]:
%pip install dask
%pip install dask pymongo
%pip install pyarrow

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
import dask.dataframe as dd
import numpy as np
import dask.array as da
import dask.bag as db
import dask
import pandas as pd
import time
import psutil
import re
import os
import tracemalloc

### Import Data

**1. Pandas**

In [11]:
# Setup
process_import_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_import_pandas = time.perf_counter()
start_cpu_time_import_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_import_pandas = psutil.cpu_percent(interval=None)

# ===============================================

# === Read from MongoDB into pandas DataFrame ===
data_import_pandas = list(collection.find())
df_pandas = pd.DataFrame(data_import_pandas)
df_pandas.drop(columns=['_id'], inplace=True)
# ===============================================

# Record end time, memory, and CPU usage
end_time_import_pandas = time.perf_counter()
end_cpu_time_import_pandas = psutil.cpu_times().user  # End CPU time
current_mem_import_pandas, peak_mem_import_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_import_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_import_pandas = end_time_import_pandas - start_time_import_pandas
cpu_time_import_pandas = (end_cpu_time_import_pandas - start_cpu_time_import_pandas) * 1000  # CPU time in ms
total_rows_import_pandas = len(df_pandas)
throughput_import_pandas = total_rows_import_pandas / elapsed_time_import_pandas if elapsed_time_import_pandas > 0 else 0

# Output preview
print("✅ Data loaded from MongoDB")
print(df_pandas.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows loaded     : {total_rows_import_pandas}")
print(f"Wall Time (Elapsed)   : {elapsed_time_import_pandas:.4f} seconds")
print(f"CPU Time              : {cpu_time_import_pandas:.0f} ms")
print(f"CPU Usage             : {cpu_usage_import_pandas:.2f}%")
print(f"Throughput            : {throughput_import_pandas:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_import_pandas / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_import_pandas / 1e6:.2f} MB")
print("=================================================")


✅ Data loaded from MongoDB
                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type            Mileage  \
0              2023       SUV  Petrol - Unleaded (ULP)     5 - 10K KM       
1              2010       MPV  Petrol - Unleaded (ULP)  115 - 120K KM       
2              2020     Coupe  Petrol - Unleaded (ULP)    20 - 25K KM       

  Transmission  Color   Price     Installment  \
0    Automatic  Black  375000  RM 4,862/month   
1    Automatic  White   55999    RM 726/month   
2    Automatic   Grey  662222  RM 8,585/month   

                                Condition Seating Capacity           Location  \
0  http://schema.org/RefurbishedCondition                5   Selangor, Klang    


**2. Polars**

In [12]:
# Setup
process_import_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_import_polars = time.perf_counter()
start_cpu_time_import_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_import_polars = psutil.cpu_percent(interval=None)

# ===============================================

# === Read from MongoDB into Polars DataFrame ===
data_import_polars = list(collection.find())
df_polars = pl.DataFrame(data_import_polars, infer_schema_length=100000)  
df_polars = df_polars.drop("_id")

# ===============================================

# Record end time, memory, and CPU usage
end_time_import_polars = time.perf_counter()
end_cpu_time_import_polars = psutil.cpu_times().user  # End CPU time
current_mem_import_polars, peak_mem_import_polars = tracemalloc.get_traced_memory() 
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_import_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_import_polars = end_time_import_polars - start_time_import_polars
cpu_time_import_polars = (end_cpu_time_import_polars - start_cpu_time_import_polars) * 1000  # CPU time in ms
total_rows_import_polars = df_polars.shape[0]
throughput_import_polars = total_rows_import_polars / elapsed_time_import_polars if elapsed_time_import_polars > 0 else 0

# Output preview
print("✅ Data loaded from MongoDB")
print(df_polars.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows loaded     : {total_rows_import_polars}")
print(f"Wall Time (Elapsed)   : {elapsed_time_import_polars:.4f} seconds") # Time
print(f"CPU Time               : {cpu_time_import_polars:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_import_polars:.2f}%") # CPU Usage
print(f"Throughput            : {throughput_import_polars:.2f} rows/sec") # Throughput
print(f"Current Memory (Python): {current_mem_import_polars / 1e6:.2f} MB") # Memory
print(f"Peak Memory (Python)   : {peak_mem_import_polars / 1e6:.2f} MB")
print("=================================================")


✅ Data loaded from MongoDB
shape: (3, 17)
┌────────────┬───────────┬───────────┬────────────┬───┬────────────┬────────────┬────────────┬─────┐
│ Car Name   ┆ Car Brand ┆ Car Model ┆ Manufactur ┆ … ┆ Location   ┆ Sales      ┆ Seat       ┆ URL │
│ ---        ┆ ---       ┆ ---       ┆ e Year     ┆   ┆ ---        ┆ Channel    ┆ Capacity   ┆ --- │
│ str        ┆ str       ┆ str       ┆ ---        ┆   ┆ str        ┆ ---        ┆ ---        ┆ str │
│            ┆           ┆           ┆ i64        ┆   ┆            ┆ str        ┆ str        ┆     │
╞════════════╪═══════════╪═══════════╪════════════╪═══╪════════════╪════════════╪════════════╪═════╡
│ 2023 Lexus ┆ Lexus     ┆ RX350     ┆ 2023       ┆ … ┆ Selangor,  ┆ Sales      ┆ NaN        ┆ NaN │
│ RX350 2.4  ┆           ┆           ┆            ┆   ┆ Klang      ┆ Agent      ┆            ┆     │
│ F Sport S… ┆           ┆           ┆            ┆   ┆            ┆ …          ┆            ┆     │
│ 2010       ┆ Toyota    ┆ Estima    ┆ 2010      

**3. Modin**

In [13]:
# Setup
process_import_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_import_modin = time.perf_counter()
start_cpu_time_import_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_import_modin = psutil.cpu_percent(interval=None)

# ===============================================

# === Read from MongoDB into Modin DataFrame ===
data_import_modin = list(collection.find())
df_modin = md.DataFrame(data_import_modin)
df_modin.drop(columns=['_id'], inplace=True)
# ===============================================

# Record end time, memory, and CPU usage
end_time_import_modin = time.perf_counter()
end_cpu_time_import_modin = psutil.cpu_times().user  # End CPU time
current_mem_import_modin, peak_mem_import_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_import_modin = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_import_modin = end_time_import_modin - start_time_import_modin
cpu_time_import_modin = (end_cpu_time_import_modin - start_cpu_time_import_modin) * 1000  # CPU time in ms
total_rows_import_modin = len(df_modin)
throughput_import_modin = total_rows_import_modin / elapsed_time_import_modin if elapsed_time_import_modin > 0 else 0

# Output preview
print("✅ Data loaded from MongoDB")
print(df_modin.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows loaded     : {total_rows_import_modin}")
print(f"Wall Time (Elapsed)   : {elapsed_time_import_modin:.4f} seconds")
print(f"CPU Time              : {cpu_time_import_modin:.0f} ms")
print(f"CPU Usage             : {cpu_usage_import_modin:.2f}%")
print(f"Throughput            : {throughput_import_modin:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_import_modin / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_import_modin / 1e6:.2f} MB")
print("=================================================")




✅ Data loaded from MongoDB
                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type            Mileage  \
0              2023       SUV  Petrol - Unleaded (ULP)     5 - 10K KM       
1              2010       MPV  Petrol - Unleaded (ULP)  115 - 120K KM       
2              2020     Coupe  Petrol - Unleaded (ULP)    20 - 25K KM       

  Transmission  Color   Price     Installment  \
0    Automatic  Black  375000  RM 4,862/month   
1    Automatic  White   55999    RM 726/month   
2    Automatic   Grey  662222  RM 8,585/month   

                                Condition Seating Capacity           Location  \
0  http://schema.org/RefurbishedCondition                5   Selangor, Klang    




**4. Dask**

In [14]:
# Setup
process_import_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_import_dask = time.perf_counter()
start_cpu_time_import_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_import_dask = psutil.cpu_percent(interval=None)

# ===============================================
# === Read from MongoDB into Dask DataFrame ===
data = list(collection.find())  # Fetch all documents as a list of dicts
df_dask = dd.from_pandas(pd.DataFrame(data), npartitions=4)
df_dask = df_dask.drop(columns=['_id'])  # Drop the '_id' column 
# ===============================================

# Record end time, memory, and CPU usage
end_time_import_dask = time.perf_counter()
end_cpu_time_import_dask = psutil.cpu_times().user  # End CPU time
current_mem_import_dask, peak_mem_import_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_import_dask = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_import_dask = end_time_import_dask - start_time_import_dask
cpu_time_import_dask = (end_cpu_time_import_dask - start_cpu_time_import_dask) * 1000  # CPU time in ms
total_rows_import_dask = df_dask.shape[0].compute()  # Get total rows after compute
throughput_import_dask = total_rows_import_dask / elapsed_time_import_dask if elapsed_time_import_dask > 0 else 0

# Output preview 
print("✅ Data loaded from MongoDB")
print(df_dask.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows loaded     : {total_rows_import_dask}")
print(f"Wall Time (Elapsed)   : {elapsed_time_import_dask:.4f} seconds")
print(f"CPU Time              : {cpu_time_import_dask:.0f} ms")
print(f"CPU Usage             : {cpu_usage_import_dask:.2f}%")
print(f"Throughput            : {throughput_import_dask:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_import_dask / 1e6:.2f} MB")
print(f"Peak Memory (Python)  : {peak_mem_import_dask / 1e6:.2f} MB")
print("=================================================")


✅ Data loaded from MongoDB


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type            Mileage  \
0              2023       SUV  Petrol - Unleaded (ULP)     5 - 10K KM       
1              2010       MPV  Petrol - Unleaded (ULP)  115 - 120K KM       
2              2020     Coupe  Petrol - Unleaded (ULP)    20 - 25K KM       

  Transmission  Color   Price     Installment  \
0    Automatic  Black  375000  RM 4,862/month   
1    Automatic  White   55999    RM 726/month   
2    Automatic   Grey  662222  RM 8,585/month   

                                Condition Seating Capacity           Location  \
0  http://schema.org/RefurbishedCondition                5   Selangor, Klang    
1         http://schema.org

### Combine Columns

To standardize the data, we combined the 'Seat Capacity' and 'Seating Capacity' columns into one, prioritizing the non-null values from each.

**1. Pandas**

In [15]:
# Setup
process_combinecol_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_combinecol_pandas = time.perf_counter()
start_cpu_time_combinecol_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_combinecol_pandas = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Combine 'Seat Capacity' and 'Seating Capacity' ===
total_rows_combinecol_pandas = df_pandas.shape[0]

# Combine 'Seat Capacity' and 'Seating Capacity' into one column
df_pandas['Seat Capacity'] = df_pandas['Seat Capacity'].combine_first(df_pandas['Seating Capacity'])

# Drop 'Seating Capacity' column
df_pandas.drop(columns=['Seating Capacity'], inplace=True)

# Record end time, memory, and CPU usage
end_time_combinecol_pandas = time.perf_counter()
end_cpu_time_combinecol_pandas = psutil.cpu_times().user  # End CPU time
current_mem_combinecol_pandas, peak_mem_combinecol_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_combinecol_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_combinecol_pandas = end_time_combinecol_pandas - start_time_combinecol_pandas
cpu_time_combinecol_pandas = (end_cpu_time_combinecol_pandas - start_cpu_time_combinecol_pandas) * 1000  # in ms
throughput_combinecol_pandas = total_rows_combinecol_pandas / elapsed_time_combinecol_pandas if elapsed_time_combinecol_pandas > 0 else 0

# Output
print("✅ Data cleaning step completed")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_combinecol_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_combinecol_pandas:.4f} seconds")
print(f"CPU Time               : {cpu_time_combinecol_pandas:.0f} ms")
print(f"CPU Usage              : {cpu_usage_combinecol_pandas:.2f}%")
print(f"Throughput             : {throughput_combinecol_pandas:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_combinecol_pandas / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_combinecol_pandas / 1e6:.2f} MB")
print("=================================================")


✅ Data cleaning step completed

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.0824 seconds
CPU Time               : 62 ms
CPU Usage              : 8.70%
Throughput             : 2112196.89 rows/sec
Current Memory (Python): 22.34 MB
Peak Memory (Python)   : 23.71 MB


In [16]:
# Display dataframe
df_pandas.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity,URL
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10K KM,Automatic,Black,375000,"RM 4,862/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,5,
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120K KM,Automatic,White,55999,RM 726/month,http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,7,
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25K KM,Automatic,Grey,662222,"RM 8,585/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,4,
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95K KM,Automatic,Silver,67000,RM 869/month,http://schema.org/UsedCondition,"Johor, Ulu Tiram",Sales Agent ...,5,
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85K KM,Automatic,White,98999,"RM 1,283/month",http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,5,


**2. Polars**

In [17]:

# Setup
process_combinecol_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_combinecol_polars = time.perf_counter()
start_cpu_time_combinecol_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_combinecol_polars = psutil.cpu_percent(interval=None)

# ================= Data Cleaning in Polars =================
total_rows_combinecol_polars = df_polars.shape[0]

# Replace string "NaN" with nulls
df_polars = df_polars.with_columns([
    pl.when(pl.col("Seat Capacity") == "NaN").then(None).otherwise(pl.col("Seat Capacity")).alias("Seat Capacity"),
    pl.when(pl.col("Seating Capacity") == "NaN").then(None).otherwise(pl.col("Seating Capacity")).alias("Seating Capacity")
])

# Combine the two columns using coalesce
df_polars = df_polars.with_columns([
    pl.coalesce(["Seat Capacity", "Seating Capacity"]).alias("Seat Capacity")
])

# Drop column
df_polars = df_polars.drop("Seating Capacity")

# ==========================================================

# Record end time, memory, and CPU usage
end_time_combinecol_polars = time.perf_counter()
end_cpu_time_combinecol_polars = psutil.cpu_times().user  # End CPU time
current_mem_combinecol_polars, peak_mem_combinecol_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_combinecol_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_combinecol_polars = end_time_combinecol_polars - start_time_combinecol_polars
cpu_time_combinecol_polars = (end_cpu_time_combinecol_polars - start_cpu_time_combinecol_polars) * 1000  # in ms
throughput_combinecol_polars = total_rows_combinecol_polars / elapsed_time_combinecol_polars if elapsed_time_combinecol_polars > 0 else 0

# Output
print("✅ Data cleaning step completed")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_combinecol_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_combinecol_polars:.4f} seconds")
print(f"CPU Time               : {cpu_time_combinecol_polars:.0f} ms")
print(f"CPU Usage              : {cpu_usage_combinecol_polars:.2f}%")
print(f"Throughput             : {throughput_combinecol_polars:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_combinecol_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_combinecol_polars / 1e6:.2f} MB")
print("=================================================")

✅ Data cleaning step completed

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.0393 seconds
CPU Time               : 94 ms
CPU Usage              : 15.40%
Throughput             : 4429832.37 rows/sec
Current Memory (Python): 0.04 MB
Peak Memory (Python)   : 0.06 MB


In [18]:
df_polars.head()

Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity,URL
str,str,str,i64,str,str,str,str,str,i64,str,str,str,str,str,str
"""2023 Lexus RX350 2.4 F Sport S…","""Lexus""","""RX350""",2023,"""SUV""","""Petrol - Unleaded (ULP)""","""5 - 10K KM ""","""Automatic""","""Black""",375000,"""RM 4,862/month""","""http://schema.org/RefurbishedC…","""Selangor, Klang ""","""Sales Agent …","""5""","""NaN"""
"""2010 Toyota Estima 2.4 Aeras M…","""Toyota""","""Estima""",2010,"""MPV""","""Petrol - Unleaded (ULP)""","""115 - 120K KM ""","""Automatic""","""White""",55999,"""RM 726/month""","""http://schema.org/UsedConditio…","""Selangor, Kajang ""","""Sales Agent …","""7""","""NaN"""
"""2020 Porsche Cayenne Coupe 4.0…","""Porsche""","""Cayenne""",2020,"""Coupe""","""Petrol - Unleaded (ULP)""","""20 - 25K KM ""","""Automatic""","""Grey""",662222,"""RM 8,585/month""","""http://schema.org/RefurbishedC…","""Selangor, Klang ""","""Sales Agent …","""4""","""NaN"""
"""2021 Honda City 1.5 V i-VTEC H…","""Honda""","""City""",2021,"""Hatchback""","""Petrol - Unleaded (ULP)""","""90 - 95K KM ""","""Automatic""","""Silver""",67000,"""RM 869/month""","""http://schema.org/UsedConditio…","""Johor, Ulu Tiram ""","""Sales Agent …","""5""","""NaN"""
"""2022 Toyota Corolla Cross 1.8 …","""Toyota""","""Corolla Cross""",2022,"""SUV""","""Petrol - Unleaded (ULP)""","""80 - 85K KM ""","""Automatic""","""White""",98999,"""RM 1,283/month""","""http://schema.org/UsedConditio…","""Selangor, Kajang ""","""Sales Agent …","""5""","""NaN"""


**3. Modin**

In [19]:
# Setup
process_combinecol_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_combinecol_modin = time.perf_counter()
start_cpu_time_combinecol_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_combinecol_modin = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Combine 'Seat Capacity' and 'Seating Capacity' ===
total_rows_combinecol_modin = df_modin.shape[0]  

# ========================================================================

# Combine 'Seat Capacity' and 'Seating Capacity' into one column
df_modin['Seat Capacity'] = df_modin['Seat Capacity'].fillna(df_modin['Seating Capacity'])

# Drop 'Seating Capacity' column
df_modin.drop(columns=['Seating Capacity'], inplace=True)

# ========================================================================

# Record end time, memory, and CPU usage
end_time_combinecol_modin = time.perf_counter()
end_cpu_time_combinecol_modin = psutil.cpu_times().user  # End CPU time
current_mem_combinecol_modin, peak_mem_combinecol_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_combinecol_modin = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_combinecol_modin = end_time_combinecol_modin - start_time_combinecol_modin
cpu_time_combinecol_modin = (end_cpu_time_combinecol_modin - start_cpu_time_combinecol_modin) * 1000  # CPU time in ms
throughput_combinecol_modin = total_rows_combinecol_modin / elapsed_time_combinecol_modin if elapsed_time_combinecol_modin > 0 else 0 

# Output
print("✅ Data cleaning step completed")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_combinecol_modin}")  # Use renamed variable here
print(f"Wall Time (Elapsed)    : {elapsed_time_combinecol_modin:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_combinecol_modin:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_combinecol_modin:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_combinecol_modin:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_combinecol_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_combinecol_modin / 1e6:.2f} MB")
print("=================================================")


✅ Data cleaning step completed

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.4683 seconds
CPU Time               : 641 ms
CPU Usage             : 28.00%
Throughput            : 371837.94 rows/sec
Current Memory (Python): 0.83 MB
Peak Memory (Python)   : 1.24 MB


In [20]:
df_modin.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity,URL
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10K KM,Automatic,Black,375000,"RM 4,862/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,5,
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120K KM,Automatic,White,55999,RM 726/month,http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,7,
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25K KM,Automatic,Grey,662222,"RM 8,585/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,4,
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95K KM,Automatic,Silver,67000,RM 869/month,http://schema.org/UsedCondition,"Johor, Ulu Tiram",Sales Agent ...,5,
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85K KM,Automatic,White,98999,"RM 1,283/month",http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,5,


**4. Dask**

In [21]:
# Setup
process_combinecol_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_combinecol_dask = time.perf_counter()
start_cpu_time_combinecol_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_combinecol_dask = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Combine 'Seat Capacity' and 'Seating Capacity' ===
total_rows_combinecol_dask = df_dask.shape[0].compute()  # Get total rows 

# ========================================================================

# Combine 'Seat Capacity' and 'Seating Capacity' into one column
df_dask['Seat Capacity'] = df_dask['Seat Capacity'].combine_first(df_dask['Seating Capacity'])

# Drop 'Seating Capacity' column
df_dask = df_dask.drop(columns=['Seating Capacity'])

# ========================================================================

# Record end time, memory, and CPU usage
end_time_combinecol_dask = time.perf_counter()
end_cpu_time_combinecol_dask = psutil.cpu_times().user  # End CPU time
current_mem_combinecol_dask, peak_mem_combinecol_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_combinecol_dask = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_combinecol_dask = end_time_combinecol_dask - start_time_combinecol_dask
cpu_time_combinecol_dask = (end_cpu_time_combinecol_dask - start_cpu_time_combinecol_dask) * 1000  # CPU time in ms
throughput_combinecol_dask = total_rows_combinecol_dask / elapsed_time_combinecol_dask if elapsed_time_combinecol_dask > 0 else 0

# Output preview
print("✅ Data cleaning step completed")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed       : {total_rows_combinecol_dask}")
print(f"Wall Time (Elapsed)       : {elapsed_time_combinecol_dask:.4f} seconds")  # Time
print(f"CPU Time                  : {cpu_time_combinecol_dask:.0f} ms")  # CPU time
print(f"CPU Usage                 : {cpu_usage_combinecol_dask:.2f}%")  # CPU Usage
print(f"Throughput                : {throughput_combinecol_dask:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python)   : {current_mem_combinecol_dask / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)      : {peak_mem_combinecol_dask / 1e6:.2f} MB")
print("=================================================")

✅ Data cleaning step completed

Total rows processed       : 174150
Wall Time (Elapsed)       : 0.1055 seconds
CPU Time                  : 78 ms
CPU Usage                 : 7.20%
Throughput                : 1650515.34 rows/sec
Current Memory (Python)   : 0.20 MB
Peak Memory (Python)      : 0.65 MB


In [22]:
df_dask.head()

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity,URL
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10K KM,Automatic,Black,375000,"RM 4,862/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,5,
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120K KM,Automatic,White,55999,RM 726/month,http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,7,
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25K KM,Automatic,Grey,662222,"RM 8,585/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,4,
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95K KM,Automatic,Silver,67000,RM 869/month,http://schema.org/UsedCondition,"Johor, Ulu Tiram",Sales Agent ...,5,
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85K KM,Automatic,White,98999,"RM 1,283/month",http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,5,


### Remove Unwanted Column(s)

To clean the dataset, we removed the unwanted 'URL' column, which was irrelevant to the analysis.

**1. Pandas**

In [23]:
# Setup
process_removecol_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_removecol_pandas = time.perf_counter()
start_cpu_time_removecol_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_removecol_pandas = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Remove 'URL' column ===
total_rows_removecol_pandas = df_pandas.shape[0]

# ==============================================

# Remove the 'URL' column
df_pandas.drop(columns=['URL'], inplace=True)

# ==============================================

# Record end time, memory, and CPU usage
end_time_removecol_pandas = time.perf_counter()
end_cpu_time_removecol_pandas = psutil.cpu_times().user  # End CPU time
current_mem_removecol_pandas, peak_mem_removecol_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_removecol_pandas = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_removecol_pandas = end_time_removecol_pandas - start_time_removecol_pandas
cpu_time_removecol_pandas = (end_cpu_time_removecol_pandas - start_cpu_time_removecol_pandas) * 1000  # CPU time in ms
throughput_removecol_pandas = total_rows_removecol_pandas / elapsed_time_removecol_pandas if elapsed_time_removecol_pandas > 0 else 0

# Output preview
print("✅ 'URL' column removed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed       : {total_rows_removecol_pandas}")
print(f"Wall Time (Elapsed)       : {elapsed_time_removecol_pandas:.4f} seconds")  # Time
print(f"CPU Time                  : {cpu_time_removecol_pandas:.0f} ms")  # CPU time
print(f"CPU Usage                 : {cpu_usage_removecol_pandas:.2f}%")  # CPU Usage
print(f"Throughput                : {throughput_removecol_pandas:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python)   : {current_mem_removecol_pandas / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)      : {peak_mem_removecol_pandas / 1e6:.2f} MB")
print("=================================================")


✅ 'URL' column removed successfully

Total rows processed       : 174150
Wall Time (Elapsed)       : 0.0405 seconds
CPU Time                  : 31 ms
CPU Usage                 : 14.80%
Throughput                : 4298248.86 rows/sec
Current Memory (Python)   : 20.92 MB
Peak Memory (Python)      : 20.94 MB


In [24]:
# Display dataframe
df_pandas.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10K KM,Automatic,Black,375000,"RM 4,862/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120K KM,Automatic,White,55999,RM 726/month,http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25K KM,Automatic,Grey,662222,"RM 8,585/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95K KM,Automatic,Silver,67000,RM 869/month,http://schema.org/UsedCondition,"Johor, Ulu Tiram",Sales Agent ...,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85K KM,Automatic,White,98999,"RM 1,283/month",http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,5


**2. Polars**

In [25]:
# Setup
process_removecol_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_removecol_polars = time.perf_counter()
start_cpu_time_removecol_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_removecol_polars = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Remove 'URL' column ===
total_rows_removecol_polars = df_polars.height

# ==============================================

# Remove the 'URL' column
df_polars = df_polars.drop("URL")

# ==============================================

# Record end time, memory, and CPU usage
end_time_removecol_polars = time.perf_counter()
end_cpu_time_removecol_polars = psutil.cpu_times().user  # End CPU time
current_mem_removecol_polars, peak_mem_removecol_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_removecol_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_removecol_polars = end_time_removecol_polars - start_time_removecol_polars
cpu_time_removecol_polars = (end_cpu_time_removecol_polars - start_cpu_time_removecol_polars) * 1000  # CPU time in ms
throughput_removecol_polars = total_rows_removecol_polars / elapsed_time_removecol_polars if elapsed_time_removecol_polars > 0 else 0

# Output preview
print("✅ 'URL' column removed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_removecol_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_removecol_polars:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_removecol_polars:.0f} ms")  # CPU time
print(f"CPU Usage              : {cpu_usage_removecol_polars:.2f}%")  # CPU Usage
print(f"Throughput             : {throughput_removecol_polars:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_removecol_polars / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_removecol_polars / 1e6:.2f} MB")
print("=================================================")


✅ 'URL' column removed successfully

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.0011 seconds
CPU Time               : 0 ms
CPU Usage              : 6.80%
Throughput             : 154333569.63 rows/sec
Current Memory (Python): 0.00 MB
Peak Memory (Python)   : 0.02 MB


In [26]:
df_polars.head()

Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
str,str,str,i64,str,str,str,str,str,i64,str,str,str,str,str
"""2023 Lexus RX350 2.4 F Sport S…","""Lexus""","""RX350""",2023,"""SUV""","""Petrol - Unleaded (ULP)""","""5 - 10K KM ""","""Automatic""","""Black""",375000,"""RM 4,862/month""","""http://schema.org/RefurbishedC…","""Selangor, Klang ""","""Sales Agent …","""5"""
"""2010 Toyota Estima 2.4 Aeras M…","""Toyota""","""Estima""",2010,"""MPV""","""Petrol - Unleaded (ULP)""","""115 - 120K KM ""","""Automatic""","""White""",55999,"""RM 726/month""","""http://schema.org/UsedConditio…","""Selangor, Kajang ""","""Sales Agent …","""7"""
"""2020 Porsche Cayenne Coupe 4.0…","""Porsche""","""Cayenne""",2020,"""Coupe""","""Petrol - Unleaded (ULP)""","""20 - 25K KM ""","""Automatic""","""Grey""",662222,"""RM 8,585/month""","""http://schema.org/RefurbishedC…","""Selangor, Klang ""","""Sales Agent …","""4"""
"""2021 Honda City 1.5 V i-VTEC H…","""Honda""","""City""",2021,"""Hatchback""","""Petrol - Unleaded (ULP)""","""90 - 95K KM ""","""Automatic""","""Silver""",67000,"""RM 869/month""","""http://schema.org/UsedConditio…","""Johor, Ulu Tiram ""","""Sales Agent …","""5"""
"""2022 Toyota Corolla Cross 1.8 …","""Toyota""","""Corolla Cross""",2022,"""SUV""","""Petrol - Unleaded (ULP)""","""80 - 85K KM ""","""Automatic""","""White""",98999,"""RM 1,283/month""","""http://schema.org/UsedConditio…","""Selangor, Kajang ""","""Sales Agent …","""5"""


**3. Modin**

In [27]:
# Setup
process_removecol_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_removecol_modin = time.perf_counter()
start_cpu_time_removecol_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_removecol_modin = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Remove 'URL' column ===
total_rows_removecol_modin = df_modin.shape[0]

# Remove the 'URL' column
df_modin.drop(columns=['URL'], inplace=True)

# Record end time, memory, and CPU usage
end_time_removecol_modin = time.perf_counter()
end_cpu_time_removecol_modin = psutil.cpu_times().user  # End CPU time
current_mem_removecol_modin, peak_mem_removecol_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_removecol_modin = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_removecol_modin = end_time_removecol_modin - start_time_removecol_modin
cpu_time_removecol_modin = (end_cpu_time_removecol_modin - start_cpu_time_removecol_modin) * 1000  # CPU time in ms
throughput_removecol_modin = total_rows_removecol_modin / elapsed_time_removecol_modin if elapsed_time_removecol_modin > 0 else 0

# Output preview
print("✅ 'URL' column removed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed       : {total_rows_removecol_modin}")
print(f"Wall Time (Elapsed)       : {elapsed_time_removecol_modin:.4f} seconds")  # Time
print(f"CPU Time                  : {cpu_time_removecol_modin:.0f} ms")  # CPU time
print(f"CPU Usage                 : {cpu_usage_removecol_modin:.2f}%")  # CPU Usage
print(f"Throughput                : {throughput_removecol_modin:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python)   : {current_mem_removecol_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)      : {peak_mem_removecol_modin / 1e6:.2f} MB")
print("=================================================")


✅ 'URL' column removed successfully

Total rows processed       : 174150
Wall Time (Elapsed)       : 0.0048 seconds
CPU Time                  : 0 ms
CPU Usage                 : 7.50%
Throughput                : 36599974.78 rows/sec
Current Memory (Python)   : 0.01 MB
Peak Memory (Python)      : 0.03 MB


In [28]:
df_modin.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10K KM,Automatic,Black,375000,"RM 4,862/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120K KM,Automatic,White,55999,RM 726/month,http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25K KM,Automatic,Grey,662222,"RM 8,585/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95K KM,Automatic,Silver,67000,RM 869/month,http://schema.org/UsedCondition,"Johor, Ulu Tiram",Sales Agent ...,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85K KM,Automatic,White,98999,"RM 1,283/month",http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,5


**4.Dask**

In [29]:
# Setup
process_removecol_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_removecol_dask = time.perf_counter()
start_cpu_time_removecol_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_removecol_dask = psutil.cpu_percent(interval=None)

# === Data Cleaning Step: Remove 'URL' column ===
total_rows_removecol_dask = df_dask.shape[0].compute()  # Get total rows before operation

# Remove the 'URL' column
df_dask = df_dask.drop(columns=['URL'])

# Record end time, memory, and CPU usage
end_time_removecol_dask = time.perf_counter()
end_cpu_time_removecol_dask = psutil.cpu_times().user  # End CPU time
current_mem_removecol_dask, peak_mem_removecol_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_removecol_dask = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_removecol_dask = end_time_removecol_dask - start_time_removecol_dask
cpu_time_removecol_dask = (end_cpu_time_removecol_dask - start_cpu_time_removecol_dask) * 1000  # CPU time in ms
throughput_removecol_dask = total_rows_removecol_dask / elapsed_time_removecol_dask if elapsed_time_removecol_dask > 0 else 0

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed       : {total_rows_removecol_dask}")
print(f"Wall Time (Elapsed)       : {elapsed_time_removecol_dask:.4f} seconds")  # Time
print(f"CPU Time                  : {cpu_time_removecol_dask:.0f} ms")  # CPU time
print(f"CPU Usage                 : {cpu_usage_removecol_dask:.2f}%")  # CPU Usage
print(f"Throughput                : {throughput_removecol_dask:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python)   : {current_mem_removecol_dask / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)      : {peak_mem_removecol_dask / 1e6:.2f} MB")
print("=================================================")



Total rows processed       : 174150
Wall Time (Elapsed)       : 0.1541 seconds
CPU Time                  : 78 ms
CPU Usage                 : 8.00%
Throughput                : 1130332.57 rows/sec
Current Memory (Python)   : 0.13 MB
Peak Memory (Python)      : 0.65 MB


In [30]:
# Display dataframe
df_dask.head()

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10K KM,Automatic,Black,375000,"RM 4,862/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120K KM,Automatic,White,55999,RM 726/month,http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25K KM,Automatic,Grey,662222,"RM 8,585/month",http://schema.org/RefurbishedCondition,"Selangor, Klang",Sales Agent ...,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95K KM,Automatic,Silver,67000,RM 869/month,http://schema.org/UsedCondition,"Johor, Ulu Tiram",Sales Agent ...,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85K KM,Automatic,White,98999,"RM 1,283/month",http://schema.org/UsedCondition,"Selangor, Kajang",Sales Agent ...,5


### String Cleaning and Formatting

In this part, we clean string-based columns to ensure consistent formatting and remove unnecessary characters or metadata.

**1. Pandas**

In [31]:
## Part 1: Clean Installment column 
# Setup
process_cleaninstallment_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_cleaninstallment_pandas = time.time()
start_cpu_time_cleaninstallment_pandas = psutil.cpu_times().user
start_cpu_percent_cleaninstallment_pandas = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Installment' Column ===
total_rows_cleaninstallment_pandas = df_pandas.shape[0]

# Cleaning operations
df_pandas['Installment'] = df_pandas['Installment'].str.replace('RM', '', regex=False)
df_pandas['Installment'] = df_pandas['Installment'].str.replace(',', '', regex=False)
df_pandas['Installment'] = df_pandas['Installment'].str.replace('/month', '', regex=False)
df_pandas['Installment'] = pd.to_numeric(df_pandas['Installment'], errors='coerce')

# Record end time, memory, and CPU usage
end_time_cleaninstallment_pandas = time.time()
end_cpu_time_cleaninstallment_pandas = psutil.cpu_times().user
current_mem_cleaninstallment_pandas, peak_mem_cleaninstallment_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_cleaninstallment_pandas = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_cleaninstallment_pandas = end_time_cleaninstallment_pandas - start_time_cleaninstallment_pandas
cpu_time_cleaninstallment_pandas = (end_cpu_time_cleaninstallment_pandas - start_cpu_time_cleaninstallment_pandas) * 1000
throughput_cleaninstallment_pandas = total_rows_cleaninstallment_pandas / elapsed_time_cleaninstallment_pandas if elapsed_time_cleaninstallment_pandas > 0 else 0

# Output preview
print("✅ 'Installment' column cleaned successfully")
print(df_pandas[['Installment']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_cleaninstallment_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_cleaninstallment_pandas:.4f} seconds")
print(f"CPU Time               : {cpu_time_cleaninstallment_pandas:.0f} ms")
print(f"CPU Usage              : {cpu_usage_cleaninstallment_pandas:.2f}%")
print(f"Throughput             : {throughput_cleaninstallment_pandas:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_cleaninstallment_pandas / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_cleaninstallment_pandas / 1e6:.2f} MB")
print("=================================================")


✅ 'Installment' column cleaned successfully
   Installment
0         4862
1          726
2         8585

Total rows processed   : 174150
Wall Time (Elapsed)    : 1.8746 seconds
CPU Time               : 2047 ms
CPU Usage              : 9.80%
Throughput             : 92898.42 rows/sec
Current Memory (Python): 19.13 MB
Peak Memory (Python)   : 34.51 MB


In [32]:
## Part 2: Extract Clean Values from Condition column

# Setup
process_condition_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_condition_pandas = time.time()
start_cpu_time_condition_pandas = psutil.cpu_times().user
start_cpu_percent_condition_pandas = psutil.cpu_percent(interval=None)

# === Data Cleaning: Extract Clean Values from 'Condition' Column ===
total_rows_condition_pandas = df_pandas.shape[0]

# Extract value from '...schema.org/XCondition'
df_pandas['Condition'] = df_pandas['Condition'].apply(
    lambda x: re.search(r'\.org/([A-Za-z]+)Condition', x).group(1) if isinstance(x, str) and 'schema.org/' in x else x
)

# Strip 'Condition' if it remains (for post-processed cases)
df_pandas['Condition'] = df_pandas['Condition'].str.replace(r'Condition$', '', regex=True)

# Record end time, memory, and CPU usage
end_time_condition_pandas = time.time()
end_cpu_time_condition_pandas = psutil.cpu_times().user
current_mem_condition_pandas, peak_mem_condition_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_condition_pandas = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_condition_pandas = end_time_condition_pandas - start_time_condition_pandas
cpu_time_condition_pandas = (end_cpu_time_condition_pandas - start_cpu_time_condition_pandas) * 1000
throughput_condition_pandas = total_rows_condition_pandas / elapsed_time_condition_pandas if elapsed_time_condition_pandas > 0 else 0

# Output preview
print("✅ 'Condition' column cleaned successfully")
print(df_pandas[['Condition']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_condition_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_condition_pandas:.4f} seconds")
print(f"CPU Time               : {cpu_time_condition_pandas:.0f} ms")
print(f"CPU Usage              : {cpu_usage_condition_pandas:.2f}%")
print(f"Throughput             : {throughput_condition_pandas:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_condition_pandas / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_condition_pandas / 1e6:.2f} MB")
print("=================================================")


✅ 'Condition' column cleaned successfully
     Condition
0  Refurbished
1         Used
2  Refurbished

Total rows processed   : 174150
Wall Time (Elapsed)    : 1.2767 seconds
CPU Time               : 1375 ms
CPU Usage              : 8.50%
Throughput             : 136404.88 rows/sec
Current Memory (Python): 9.13 MB
Peak Memory (Python)   : 16.64 MB


In [33]:
## Part 3: Clean Sales Channel 

# Setup
process_saleschannel_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_saleschannel_pandas = time.time()
start_cpu_time_saleschannel_pandas = psutil.cpu_times().user
start_cpu_percent_saleschannel_pandas = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Sales Channel' Column ===
total_rows_saleschannel_pandas = df_pandas.shape[0]

# Extract Sales Channel (either 'Sales Agent' or 'Dealer')
df_pandas['Sales Channel'] = df_pandas['Sales Channel'].str.extract(r'^(Sales Agent|Dealer)')

# Record end time, memory, and CPU usage
end_time_saleschannel_pandas = time.time()
end_cpu_time_saleschannel_pandas = psutil.cpu_times().user
current_mem_saleschannel_pandas, peak_mem_saleschannel_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_saleschannel_pandas = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_saleschannel_pandas = end_time_saleschannel_pandas - start_time_saleschannel_pandas
cpu_time_saleschannel_pandas = (end_cpu_time_saleschannel_pandas - start_cpu_time_saleschannel_pandas) * 1000
throughput_saleschannel_pandas = total_rows_saleschannel_pandas / elapsed_time_saleschannel_pandas if elapsed_time_saleschannel_pandas > 0 else 0

# Output preview
print("✅ 'Sales Channel' column cleaned successfully")
print(df_pandas[['Sales Channel']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_saleschannel_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_saleschannel_pandas:.4f} seconds")
print(f"CPU Time               : {cpu_time_saleschannel_pandas:.0f} ms")
print(f"CPU Usage              : {cpu_usage_saleschannel_pandas:.2f}%")
print(f"Throughput             : {throughput_saleschannel_pandas:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_saleschannel_pandas / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_saleschannel_pandas / 1e6:.2f} MB")
print("=================================================")

✅ 'Sales Channel' column cleaned successfully
  Sales Channel
0   Sales Agent
1   Sales Agent
2   Sales Agent

Total rows processed   : 174150
Wall Time (Elapsed)    : 1.2360 seconds
CPU Time               : 1578 ms
CPU Usage              : 7.30%
Throughput             : 140902.18 rows/sec
Current Memory (Python): 9.84 MB
Peak Memory (Python)   : 27.68 MB


In [34]:
## Part 4: Clean Mileage 

# Setup
process_mileage_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_mileage_pandas = time.time()
start_cpu_time_mileage_pandas = psutil.cpu_times().user
start_cpu_percent_mileage_pandas = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Mileage' Column ===
total_rows_mileage_pandas = df_pandas.shape[0]

# Remove "K KM" and clean mileage into 5K step ranges
def process_mileage_pandas(mileage):
    if not isinstance(mileage, str) or mileage.strip() == '':
        return None
    try:
        if '-' in mileage:
            start, end = mileage.split('-')
            return f"{int(start.strip())} - {int(end.strip())}"
        else:
            val = int(mileage.strip())
            lower = (val // 5) * 5
            upper = lower + 5
            return f"{lower} - {upper}"
    except:
        return None

df_pandas['Mileage'] = df_pandas['Mileage'].str.replace('K KM', '', regex=False)
df_pandas['Mileage'] = df_pandas['Mileage'].apply(process_mileage_pandas)

# Record end time, memory, and CPU usage
end_time_mileage_pandas = time.time()
end_cpu_time_mileage_pandas = psutil.cpu_times().user
current_mem_mileage_pandas, peak_mem_mileage_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_mileage_pandas = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_mileage_pandas = end_time_mileage_pandas - start_time_mileage_pandas
cpu_time_mileage_pandas = (end_cpu_time_mileage_pandas - start_cpu_time_mileage_pandas) * 1000
throughput_mileage_pandas = total_rows_mileage_pandas / elapsed_time_mileage_pandas if elapsed_time_mileage_pandas > 0 else 0

# Output preview
print("✅ 'Mileage' column cleaned successfully")
print(df_pandas[['Mileage']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_mileage_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_mileage_pandas:.4f} seconds")
print(f"CPU Time               : {cpu_time_mileage_pandas:.0f} ms")
print(f"CPU Usage              : {cpu_usage_mileage_pandas:.2f}%")
print(f"Throughput             : {throughput_mileage_pandas:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_mileage_pandas / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_mileage_pandas / 1e6:.2f} MB")
print("=================================================")


✅ 'Mileage' column cleaned successfully
     Mileage
0     5 - 10
1  115 - 120
2    20 - 25

Total rows processed   : 174150
Wall Time (Elapsed)    : 2.5527 seconds
CPU Time               : 2719 ms
CPU Usage              : 11.90%
Throughput             : 68222.64 rows/sec
Current Memory (Python): 9.75 MB
Peak Memory (Python)   : 26.75 MB


In [35]:
# Display dataframe
df_pandas.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10,Automatic,Black,375000,4862,Refurbished,"Selangor, Klang",Sales Agent,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120,Automatic,White,55999,726,Used,"Selangor, Kajang",Sales Agent,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25,Automatic,Grey,662222,8585,Refurbished,"Selangor, Klang",Sales Agent,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95,Automatic,Silver,67000,869,Used,"Johor, Ulu Tiram",Sales Agent,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85,Automatic,White,98999,1283,Used,"Selangor, Kajang",Sales Agent,5


**2. Polars**

In [36]:
## Part 1: Clean Installment column 

# Setup
process_cleaninstallment_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_cleaninstallment_polars = time.time()
start_cpu_time_cleaninstallment_polars = psutil.cpu_times().user
start_cpu_percent_cleaninstallment_polars = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Installment' Column ===
total_rows_cleaninstallment_polars = df_polars.height

# Clean the 'Installment' column: remove 'RM', ',', '/month', strip spaces, and convert to int
df_polars = df_polars.with_columns(
    pl.col("Installment")
    .str.replace_all("RM", "")
    .str.replace_all(",", "")
    .str.replace_all("/month", "")
    .str.strip_chars()
    .cast(pl.Int64)
)

# Record end time, memory, and CPU usage
end_time_cleaninstallment_polars = time.time()
end_cpu_time_cleaninstallment_polars = psutil.cpu_times().user
current_mem_cleaninstallment_polars, peak_mem_cleaninstallment_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# Performance metrics
cpu_usage_cleaninstallment_polars = psutil.cpu_percent(interval=1)
elapsed_time_cleaninstallment_polars = end_time_cleaninstallment_polars - start_time_cleaninstallment_polars
cpu_time_cleaninstallment_polars = (end_cpu_time_cleaninstallment_polars - start_cpu_time_cleaninstallment_polars) * 1000
throughput_cleaninstallment_polars = total_rows_cleaninstallment_polars / elapsed_time_cleaninstallment_polars if elapsed_time_cleaninstallment_polars > 0 else 0

# Output preview
print("✅ 'Installment' column cleaned successfully")
print(df_polars.select("Installment").head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_cleaninstallment_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_cleaninstallment_polars:.4f} seconds")
print(f"CPU Time               : {cpu_time_cleaninstallment_polars:.0f} ms")
print(f"CPU Usage              : {cpu_usage_cleaninstallment_polars:.2f}%")
print(f"Throughput             : {throughput_cleaninstallment_polars:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_cleaninstallment_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_cleaninstallment_polars / 1e6:.2f} MB")
print("=================================================")


✅ 'Installment' column cleaned successfully
shape: (3, 1)
┌─────────────┐
│ Installment │
│ ---         │
│ i64         │
╞═════════════╡
│ 4862        │
│ 726         │
│ 8585        │
└─────────────┘

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.0556 seconds
CPU Time               : 94 ms
CPU Usage              : 6.90%
Throughput             : 3129715.50 rows/sec
Current Memory (Python): 0.02 MB
Peak Memory (Python)   : 0.04 MB


In [37]:
## Part 2: Extract Clean Values from Condition column

# Setup
process_condition_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_condition_polars = time.time()
start_cpu_time_condition_polars = psutil.cpu_times().user
start_cpu_percent_condition_polars = psutil.cpu_percent(interval=None)

# === Data Cleaning: Extract Clean Values from 'Condition' Column ===
total_rows_condition_polars = df_polars.height

# Vectorized: Extract text before 'Condition' after the last '/' (e.g., 'NewCondition' → 'New')
df_polars = df_polars.with_columns(
    pl.col("Condition")
    .str.extract(r"/([A-Za-z]+)Condition", 1)
    .fill_null(
        pl.col("Condition").str.replace(r"Condition$", "", literal=False)
    )
    .alias("Condition")
)

# Record end time, memory, and CPU usage
end_time_condition_polars = time.time()
end_cpu_time_condition_polars = psutil.cpu_times().user
current_mem_condition_polars, peak_mem_condition_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# Performance metrics
cpu_usage_condition_polars = psutil.cpu_percent(interval=1)
elapsed_time_condition_polars = end_time_condition_polars - start_time_condition_polars
cpu_time_condition_polars = (end_cpu_time_condition_polars - start_cpu_time_condition_polars) * 1000
throughput_condition_polars = total_rows_condition_polars / elapsed_time_condition_polars if elapsed_time_condition_polars > 0 else 0

# Output preview
print("✅ 'Condition' column cleaned successfully")
print(df_polars.select("Condition").head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_condition_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_condition_polars:.4f} seconds")
print(f"CPU Time               : {cpu_time_condition_polars:.0f} ms")
print(f"CPU Usage              : {cpu_usage_condition_polars:.2f}%")
print(f"Throughput             : {throughput_condition_polars:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_condition_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_condition_polars / 1e6:.2f} MB")
print("=================================================")


✅ 'Condition' column cleaned successfully
shape: (3, 1)
┌─────────────┐
│ Condition   │
│ ---         │
│ str         │
╞═════════════╡
│ Refurbished │
│ Used        │
│ Refurbished │
└─────────────┘

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.0710 seconds
CPU Time               : 62 ms
CPU Usage              : 11.00%
Throughput             : 2453134.74 rows/sec
Current Memory (Python): 0.02 MB
Peak Memory (Python)   : 0.29 MB


In [38]:
## Part 3: Clean Sales Channel 

# Setup
process_saleschannel_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_saleschannel_polars = time.time()
start_cpu_time_saleschannel_polars = psutil.cpu_times().user
start_cpu_percent_saleschannel_polars = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Sales Channel' Column ===
total_rows_saleschannel_polars = df_polars.height

# Extract either "Sales Agent" or "Dealer" using regex
df_polars = df_polars.with_columns(
    pl.col("Sales Channel")
    .str.extract(r"^(Sales Agent|Dealer)", 1)
    .alias("Sales Channel")
)

# Record end time, memory, and CPU usage
end_time_saleschannel_polars = time.time()
end_cpu_time_saleschannel_polars = psutil.cpu_times().user
current_mem_saleschannel_polars, peak_mem_saleschannel_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# Performance metrics
cpu_usage_saleschannel_polars = psutil.cpu_percent(interval=1)
elapsed_time_saleschannel_polars = end_time_saleschannel_polars - start_time_saleschannel_polars
cpu_time_saleschannel_polars = (end_cpu_time_saleschannel_polars - start_cpu_time_saleschannel_polars) * 1000
throughput_saleschannel_polars = total_rows_saleschannel_polars / elapsed_time_saleschannel_polars if elapsed_time_saleschannel_polars > 0 else 0

# Output preview
print("✅ 'Sales Channel' column cleaned successfully")
print(df_polars.select("Sales Channel").head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_saleschannel_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_saleschannel_polars:.4f} seconds")
print(f"CPU Time               : {cpu_time_saleschannel_polars:.0f} ms")
print(f"CPU Usage              : {cpu_usage_saleschannel_polars:.2f}%")
print(f"Throughput             : {throughput_saleschannel_polars:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_saleschannel_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_saleschannel_polars / 1e6:.2f} MB")
print("=================================================")


✅ 'Sales Channel' column cleaned successfully
shape: (3, 1)
┌───────────────┐
│ Sales Channel │
│ ---           │
│ str           │
╞═══════════════╡
│ Sales Agent   │
│ Sales Agent   │
│ Sales Agent   │
└───────────────┘

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.0217 seconds
CPU Time               : 16 ms
CPU Usage              : 9.50%
Throughput             : 8010594.42 rows/sec
Current Memory (Python): 0.02 MB
Peak Memory (Python)   : 0.04 MB


In [39]:
## Part 4: Clean Mileage

# Setup
process_mileage_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_mileage_polars = time.time()
start_cpu_time_mileage_polars = psutil.cpu_times().user
start_cpu_percent_mileage_polars = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Mileage' Column ===
total_rows_mileage_polars = df_polars.height

# Define a function to clean mileage and convert into 5K step range
def clean_mileage(val):
    if val is None or str(val).strip() == '':
        return None
    val = str(val).replace('K KM', '').strip()
    try:
        if '-' in val:
            start, end = val.split('-')
            return f"{int(start.strip())} - {int(end.strip())}"
        else:
            num = int(val)
            lower = (num // 5) * 5
            upper = lower + 5
            return f"{lower} - {upper}"
    except:
        return None

# Apply the cleaning using map_elements
df_polars = df_polars.with_columns(
    pl.col("Mileage")
    .map_elements(clean_mileage, return_dtype=pl.Utf8)
    .alias("Mileage")
)

# Record end time, memory, and CPU usage
end_time_mileage_polars = time.time()
end_cpu_time_mileage_polars = psutil.cpu_times().user
current_mem_mileage_polars, peak_mem_mileage_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# Performance metrics
cpu_usage_mileage_polars = psutil.cpu_percent(interval=1)
elapsed_time_mileage_polars = end_time_mileage_polars - start_time_mileage_polars
cpu_time_mileage_polars = (end_cpu_time_mileage_polars - start_cpu_time_mileage_polars) * 1000
throughput_mileage_polars = total_rows_mileage_polars / elapsed_time_mileage_polars if elapsed_time_mileage_polars > 0 else 0

# Output preview
print("✅ 'Mileage' column cleaned successfully")
print(df_polars.select("Mileage").head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_mileage_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_mileage_polars:.4f} seconds")
print(f"CPU Time               : {cpu_time_mileage_polars:.0f} ms")
print(f"CPU Usage              : {cpu_usage_mileage_polars:.2f}%")
print(f"Throughput             : {throughput_mileage_polars:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_mileage_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_mileage_polars / 1e6:.2f} MB")
print("=================================================")


✅ 'Mileage' column cleaned successfully
shape: (3, 1)
┌───────────┐
│ Mileage   │
│ ---       │
│ str       │
╞═══════════╡
│ 5 - 10    │
│ 115 - 120 │
│ 20 - 25   │
└───────────┘

Total rows processed   : 174150
Wall Time (Elapsed)    : 2.5342 seconds
CPU Time               : 2922 ms
CPU Usage              : 12.50%
Throughput             : 68719.17 rows/sec
Current Memory (Python): 0.28 MB
Peak Memory (Python)   : 0.54 MB


In [40]:
df_polars.head()

Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
str,str,str,i64,str,str,str,str,str,i64,i64,str,str,str,str
"""2023 Lexus RX350 2.4 F Sport S…","""Lexus""","""RX350""",2023,"""SUV""","""Petrol - Unleaded (ULP)""","""5 - 10""","""Automatic""","""Black""",375000,4862,"""Refurbished""","""Selangor, Klang ""","""Sales Agent""","""5"""
"""2010 Toyota Estima 2.4 Aeras M…","""Toyota""","""Estima""",2010,"""MPV""","""Petrol - Unleaded (ULP)""","""115 - 120""","""Automatic""","""White""",55999,726,"""Used""","""Selangor, Kajang ""","""Sales Agent""","""7"""
"""2020 Porsche Cayenne Coupe 4.0…","""Porsche""","""Cayenne""",2020,"""Coupe""","""Petrol - Unleaded (ULP)""","""20 - 25""","""Automatic""","""Grey""",662222,8585,"""Refurbished""","""Selangor, Klang ""","""Sales Agent""","""4"""
"""2021 Honda City 1.5 V i-VTEC H…","""Honda""","""City""",2021,"""Hatchback""","""Petrol - Unleaded (ULP)""","""90 - 95""","""Automatic""","""Silver""",67000,869,"""Used""","""Johor, Ulu Tiram ""","""Sales Agent""","""5"""
"""2022 Toyota Corolla Cross 1.8 …","""Toyota""","""Corolla Cross""",2022,"""SUV""","""Petrol - Unleaded (ULP)""","""80 - 85""","""Automatic""","""White""",98999,1283,"""Used""","""Selangor, Kajang ""","""Sales Agent""","""5"""


**3. Modin**

In [41]:
## Part 1: Clean Installment column 

# Setup
process_cleaninstallment_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_cleaninstallment_modin = time.time()
start_cpu_time_cleaninstallment_modin = psutil.cpu_times().user
start_cpu_percent_cleaninstallment_modin = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Installment' Column ===
total_rows_cleaninstallment_modin = df_modin.shape[0]

# Cleaning operations
df_modin['Installment'] = df_modin['Installment'].str.replace('RM', '', regex=False)
df_modin['Installment'] = df_modin['Installment'].str.replace(',', '', regex=False)
df_modin['Installment'] = df_modin['Installment'].str.replace('/month', '', regex=False)
df_modin['Installment'] = pd.to_numeric(df_modin['Installment'], errors='coerce')

# Record end time, memory, and CPU usage
end_time_cleaninstallment_modin = time.time()
end_cpu_time_cleaninstallment_modin = psutil.cpu_times().user
current_mem_cleaninstallment_modin, peak_mem_cleaninstallment_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_cleaninstallment_modin = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_cleaninstallment_modin = end_time_cleaninstallment_modin - start_time_cleaninstallment_modin
cpu_time_cleaninstallment_modin = (end_cpu_time_cleaninstallment_modin - start_cpu_time_cleaninstallment_modin) * 1000
throughput_cleaninstallment_modin = total_rows_cleaninstallment_modin / elapsed_time_cleaninstallment_modin if elapsed_time_cleaninstallment_modin > 0 else 0

# Output preview
print("✅ 'Installment' column cleaned successfully")
print(df_modin[['Installment']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_cleaninstallment_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_cleaninstallment_modin:.4f} seconds")
print(f"CPU Time               : {cpu_time_cleaninstallment_modin:.0f} ms")
print(f"CPU Usage              : {cpu_usage_cleaninstallment_modin:.2f}%")
print(f"Throughput             : {throughput_cleaninstallment_modin:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_cleaninstallment_modin / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_cleaninstallment_modin / 1e6:.2f} MB")
print("=================================================")

✅ 'Installment' column cleaned successfully
   Installment
0         4862
1          726
2         8585

Total rows processed   : 174150
Wall Time (Elapsed)    : 3.9595 seconds
CPU Time               : 5094 ms
CPU Usage              : 5.50%
Throughput             : 43983.06 rows/sec
Current Memory (Python): 3.11 MB
Peak Memory (Python)   : 19.68 MB


In [42]:
## Part 2: Extract Clean Values from Condition column
# Setup
process_condition_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_condition_modin = time.time()
start_cpu_time_condition_modin = psutil.cpu_times().user
start_cpu_percent_condition_modin = psutil.cpu_percent(interval=None)

# === Data Cleaning: Extract Clean Values from 'Condition' Column ===
total_rows_condition_modin = df_modin.shape[0]

# First: Extract value from '...schema.org/XCondition'
df_modin['Condition'] = df_modin['Condition'].apply(
    lambda x: re.search(r'\.org/([A-Za-z]+)Condition', x).group(1) if isinstance(x, str) and 'schema.org/' in x else x
)

# Second: Strip 'Condition' if it remains (for post-processed cases)
df_modin['Condition'] = df_modin['Condition'].str.replace(r'Condition$', '', regex=True)

# Record end time, memory, and CPU usage
end_time_condition_modin = time.time()
end_cpu_time_condition_modin = psutil.cpu_times().user
current_mem_condition_modin, peak_mem_condition_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_condition_modin = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_condition_modin = end_time_condition_modin - start_time_condition_modin
cpu_time_condition_modin = (end_cpu_time_condition_modin - start_cpu_time_condition_modin) * 1000
throughput_condition_modin = total_rows_condition_modin / elapsed_time_condition_modin if elapsed_time_condition_modin > 0 else 0

# Output preview
print("✅ 'Condition' column cleaned successfully")
print(df_modin[['Condition']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_condition_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_condition_modin:.4f} seconds")
print(f"CPU Time               : {cpu_time_condition_modin:.0f} ms")
print(f"CPU Usage              : {cpu_usage_condition_modin:.2f}%")
print(f"Throughput             : {throughput_condition_modin:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_condition_modin / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_condition_modin / 1e6:.2f} MB")
print("=================================================")


✅ 'Condition' column cleaned successfully
     Condition
0  Refurbished
1         Used
2  Refurbished

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.7658 seconds
CPU Time               : 1391 ms
CPU Usage              : 9.50%
Throughput             : 227422.13 rows/sec
Current Memory (Python): 1.31 MB
Peak Memory (Python)   : 1.75 MB


In [43]:
## Part 3: Clean Sales Channel 

# Setup
process_saleschannel_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_saleschannel_modin = time.time()
start_cpu_time_saleschannel_modin = psutil.cpu_times().user
start_cpu_percent_saleschannel_modin = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Sales Channel' Column ===
total_rows_saleschannel_modin = df_modin.shape[0]

# Extract Sales Channel (either 'Sales Agent' or 'Dealer')
df_modin['Sales Channel'] = df_modin['Sales Channel'].str.extract(r'^(Sales Agent|Dealer)')

# Record end time, memory, and CPU usage
end_time_saleschannel_modin = time.time()
end_cpu_time_saleschannel_modin = psutil.cpu_times().user
current_mem_saleschannel_modin, peak_mem_saleschannel_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_saleschannel_modin = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_saleschannel_modin = end_time_saleschannel_modin - start_time_saleschannel_modin
cpu_time_saleschannel_modin = (end_cpu_time_saleschannel_modin - start_cpu_time_saleschannel_modin) * 1000
throughput_saleschannel_modin = total_rows_saleschannel_modin / elapsed_time_saleschannel_modin if elapsed_time_saleschannel_modin > 0 else 0

# Output preview
print("✅ 'Sales Channel' column cleaned successfully")
print(df_modin[['Sales Channel']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_saleschannel_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_saleschannel_modin:.4f} seconds")
print(f"CPU Time               : {cpu_time_saleschannel_modin:.0f} ms")
print(f"CPU Usage              : {cpu_usage_saleschannel_modin:.2f}%")
print(f"Throughput             : {throughput_saleschannel_modin:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_saleschannel_modin / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_saleschannel_modin / 1e6:.2f} MB")
print("=================================================")

✅ 'Sales Channel' column cleaned successfully
  Sales Channel
0   Sales Agent
1   Sales Agent
2   Sales Agent

Total rows processed   : 174150
Wall Time (Elapsed)    : 1.7670 seconds
CPU Time               : 2922 ms
CPU Usage              : 12.50%
Throughput             : 98556.84 rows/sec
Current Memory (Python): 1.67 MB
Peak Memory (Python)   : 19.02 MB


In [44]:
## Part 4: Clean Mileage

# Setup
process_mileage_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_mileage_modin = time.time()
start_cpu_time_mileage_modin = psutil.cpu_times().user
start_cpu_percent_mileage_modin = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Mileage' Column ===
total_rows_mileage_modin = df_modin.shape[0]

# Remove "K KM" and clean mileage into 5K step ranges
def process_mileage_modin(mileage):
    if not isinstance(mileage, str) or mileage.strip() == '':
        return None
    try:
        if '-' in mileage:
            start, end = mileage.split('-')
            return f"{int(start.strip())} - {int(end.strip())}"
        else:
            val = int(mileage.strip())
            lower = (val // 5) * 5
            upper = lower + 5
            return f"{lower} - {upper}"
    except:
        return None

df_modin['Mileage'] = df_modin['Mileage'].str.replace('K KM', '', regex=False)
df_modin['Mileage'] = df_modin['Mileage'].apply(process_mileage_modin)

# Record end time, memory, and CPU usage
end_time_mileage_modin = time.time()
end_cpu_time_mileage_modin = psutil.cpu_times().user
current_mem_mileage_modin, peak_mem_mileage_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_mileage_modin = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_mileage_modin = end_time_mileage_modin - start_time_mileage_modin
cpu_time_mileage_modin = (end_cpu_time_mileage_modin - start_cpu_time_mileage_modin) * 1000
throughput_mileage_modin = total_rows_mileage_modin / elapsed_time_mileage_modin if elapsed_time_mileage_modin > 0 else 0

# Output preview
print("✅ 'Mileage' column cleaned successfully")
print(df_modin[['Mileage']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_mileage_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_mileage_modin:.4f} seconds")
print(f"CPU Time               : {cpu_time_mileage_modin:.0f} ms")
print(f"CPU Usage              : {cpu_usage_mileage_modin:.2f}%")
print(f"Throughput             : {throughput_mileage_modin:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_mileage_modin / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_mileage_modin / 1e6:.2f} MB")
print("=================================================")


✅ 'Mileage' column cleaned successfully
     Mileage
0     5 - 10
1  115 - 120
2    20 - 25

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.8994 seconds
CPU Time               : 2109 ms
CPU Usage              : 20.60%
Throughput             : 193626.21 rows/sec
Current Memory (Python): 1.28 MB
Peak Memory (Python)   : 1.89 MB


In [45]:
df_modin.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10,Automatic,Black,375000,4862,Refurbished,"Selangor, Klang",Sales Agent,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120,Automatic,White,55999,726,Used,"Selangor, Kajang",Sales Agent,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25,Automatic,Grey,662222,8585,Refurbished,"Selangor, Klang",Sales Agent,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95,Automatic,Silver,67000,869,Used,"Johor, Ulu Tiram",Sales Agent,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85,Automatic,White,98999,1283,Used,"Selangor, Kajang",Sales Agent,5


**4. Dask**

In [46]:
## Part 1: Clean Installment column 

# Setup
process_cleaninstallment_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_cleaninstallment_dask = time.perf_counter()
start_cpu_time_cleaninstallment_dask = psutil.cpu_times().user
start_cpu_percent_cleaninstallment_dask = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Installment' Column ===
total_rows_cleaninstallment_dask = df_dask.shape[0].compute()

# Convert to string before applying str methods
df_dask['Installment'] = df_dask['Installment'].astype(str)
df_dask['Installment'] = df_dask['Installment'].str.replace('RM', '', regex=False)
df_dask['Installment'] = df_dask['Installment'].str.replace(',', '', regex=False)
df_dask['Installment'] = df_dask['Installment'].str.replace('/month', '', regex=False)
df_dask['Installment'] = dd.to_numeric(df_dask['Installment'], errors='coerce')

# Record end time, memory, and CPU usage
end_time_cleaninstallment_dask = time.perf_counter()
end_cpu_time_cleaninstallment_dask = psutil.cpu_times().user
current_mem_cleaninstallment_dask, peak_mem_cleaninstallment_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_cleaninstallment_dask = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_cleaninstallment_dask = end_time_cleaninstallment_dask - start_time_cleaninstallment_dask
cpu_time_cleaninstallment_dask = (end_cpu_time_cleaninstallment_dask - start_cpu_time_cleaninstallment_dask) * 1000
throughput_cleaninstallment_dask = total_rows_cleaninstallment_dask / elapsed_time_cleaninstallment_dask if elapsed_time_cleaninstallment_dask > 0 else 0

# Output preview
print("✅ 'Installment' column cleaned successfully")
print(df_dask[['Installment']].head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed     : {total_rows_cleaninstallment_dask}")
print(f"Wall Time (Elapsed)      : {elapsed_time_cleaninstallment_dask:.4f} seconds")
print(f"CPU Time                 : {cpu_time_cleaninstallment_dask:.0f} ms")
print(f"CPU Usage                : {cpu_usage_cleaninstallment_dask:.2f}%")
print(f"Throughput               : {throughput_cleaninstallment_dask:.2f} rows/sec")
print(f"Current Memory (Python)  : {current_mem_cleaninstallment_dask / 1e6:.2f} MB")
print(f"Peak Memory (Python)     : {peak_mem_cleaninstallment_dask / 1e6:.2f} MB")
print("=================================================")


✅ 'Installment' column cleaned successfully


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


   Installment
0         4862
1          726
2         8585

Total rows processed     : 174150
Wall Time (Elapsed)      : 0.1376 seconds
CPU Time                 : 109 ms
CPU Usage                : 8.10%
Throughput               : 1265861.43 rows/sec
Current Memory (Python)  : 0.26 MB
Peak Memory (Python)     : 0.66 MB


In [47]:
## Part 2: Extract Clean Values from Condition column 

# Setup
process_condition_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_condition_dask = time.time()
start_cpu_time_condition_dask = psutil.cpu_times().user
start_cpu_percent_condition_dask = psutil.cpu_percent(interval=None)

# === Data Cleaning: Extract Clean Values from 'Condition' Column ===
total_rows_condition_dask = df_dask.shape[0].compute()

# First: Extract value from '...schema.org/XCondition'
df_dask['Condition'] = df_dask['Condition'].apply(
    lambda x: re.search(r'\.org/([A-Za-z]+)Condition', x).group(1) if isinstance(x, str) and 'schema.org/' in x else x,
    meta=('Condition', 'object')
)

# Second: Strip 'Condition' if it remains (for post-processed cases)
df_dask['Condition'] = df_dask['Condition'].str.replace(r'Condition$', '', regex=True)

# Record end time, memory, and CPU usage
end_time_condition_dask = time.time()
end_cpu_time_condition_dask = psutil.cpu_times().user
current_mem_condition_dask, peak_mem_condition_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_condition_dask = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_condition_dask = end_time_condition_dask - start_time_condition_dask
cpu_time_condition_dask = (end_cpu_time_condition_dask - start_cpu_time_condition_dask) * 1000
throughput_condition_dask = total_rows_condition_dask / elapsed_time_condition_dask if elapsed_time_condition_dask > 0 else 0

# Output preview
print("✅ 'Condition' column cleaned successfully")
print(df_dask[['Condition']].compute().head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_condition_dask}")
print(f"Wall Time (Elapsed)    : {elapsed_time_condition_dask:.4f} seconds")
print(f"CPU Time               : {cpu_time_condition_dask:.0f} ms")
print(f"CPU Usage              : {cpu_usage_condition_dask:.2f}%")
print(f"Throughput             : {throughput_condition_dask:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_condition_dask / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_condition_dask / 1e6:.2f} MB")
print("=================================================")


✅ 'Condition' column cleaned successfully


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


     Condition
0  Refurbished
1         Used
2  Refurbished

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.3295 seconds
CPU Time               : 375 ms
CPU Usage              : 11.80%
Throughput             : 528492.96 rows/sec
Current Memory (Python): 0.28 MB
Peak Memory (Python)   : 0.78 MB


In [48]:
## Part 3: Clean Sales Channel

# Setup
process_saleschannel_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_saleschannel_dask = time.time()
start_cpu_time_saleschannel_dask = psutil.cpu_times().user
start_cpu_percent_saleschannel_dask = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Sales Channel' Column ===
total_rows_saleschannel_dask = df_dask.shape[0].compute()

# Extract Sales Channel (either 'Sales Agent' or 'Dealer')
df_dask['Sales Channel'] = df_dask['Sales Channel'].str.extract(r'^(Sales Agent|Dealer)', expand=False)

# Record end time, memory, and CPU usage
end_time_saleschannel_dask = time.time()
end_cpu_time_saleschannel_dask = psutil.cpu_times().user
current_mem_saleschannel_dask, peak_mem_saleschannel_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_saleschannel_dask = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_saleschannel_dask = end_time_saleschannel_dask - start_time_saleschannel_dask
cpu_time_saleschannel_dask = (end_cpu_time_saleschannel_dask - start_cpu_time_saleschannel_dask) * 1000
throughput_saleschannel_dask = total_rows_saleschannel_dask / elapsed_time_saleschannel_dask if elapsed_time_saleschannel_dask > 0 else 0

# Output preview
print("✅ 'Sales Channel' column cleaned successfully")
print(df_dask[['Sales Channel']].compute().head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_saleschannel_dask}")
print(f"Wall Time (Elapsed)    : {elapsed_time_saleschannel_dask:.4f} seconds")
print(f"CPU Time               : {cpu_time_saleschannel_dask:.0f} ms")
print(f"CPU Usage              : {cpu_usage_saleschannel_dask:.2f}%")
print(f"Throughput             : {throughput_saleschannel_dask:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_saleschannel_dask / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_saleschannel_dask / 1e6:.2f} MB")
print("=================================================")


✅ 'Sales Channel' column cleaned successfully


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


  Sales Channel
0   Sales Agent
1   Sales Agent
2   Sales Agent

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.3289 seconds
CPU Time               : 703 ms
CPU Usage              : 13.60%
Throughput             : 529517.81 rows/sec
Current Memory (Python): 0.20 MB
Peak Memory (Python)   : 0.72 MB


In [49]:
## Part 4: Clean Mileage 

# Setup
process_mileage_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_mileage_dask = time.time()
start_cpu_time_mileage_dask = psutil.cpu_times().user
start_cpu_percent_mileage_dask = psutil.cpu_percent(interval=None)

# === Data Cleaning: Clean 'Mileage' Column ===
total_rows_mileage_dask = df_dask.shape[0].compute()

# Remove "K KM"
df_dask['Mileage'] = df_dask['Mileage'].str.replace('K KM', '', regex=False)

# Clean and standardize mileage into 5K step ranges
def process_mileage_dask(mileage):
    if not isinstance(mileage, str) or mileage.strip() == '':
        return None
    try:
        if '-' in mileage:
            start, end = mileage.split('-')
            return f"{int(start.strip())} - {int(end.strip())}"
        else:
            val = int(mileage.strip())
            lower = (val // 5) * 5
            upper = lower + 5
            return f"{lower} - {upper}"
    except:
        return None

df_dask['Mileage'] = df_dask['Mileage'].apply(process_mileage_dask, meta=('x', 'object'))

# Record end time, memory, and CPU usage
end_time_mileage_dask = time.time()
end_cpu_time_mileage_dask = psutil.cpu_times().user
current_mem_mileage_dask, peak_mem_mileage_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_mileage_dask = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_mileage_dask = end_time_mileage_dask - start_time_mileage_dask
cpu_time_mileage_dask = (end_cpu_time_mileage_dask - start_cpu_time_mileage_dask) * 1000
throughput_mileage_dask = total_rows_mileage_dask / elapsed_time_mileage_dask if elapsed_time_mileage_dask > 0 else 0

# Output preview
print("✅ 'Mileage' column cleaned successfully")
print(df_dask[['Mileage']].compute().head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_mileage_dask}")
print(f"Wall Time (Elapsed)    : {elapsed_time_mileage_dask:.4f} seconds")
print(f"CPU Time               : {cpu_time_mileage_dask:.0f} ms")
print(f"CPU Usage              : {cpu_usage_mileage_dask:.2f}%")
print(f"Throughput             : {throughput_mileage_dask:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_mileage_dask / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_mileage_dask / 1e6:.2f} MB")
print("=================================================")

✅ 'Mileage' column cleaned successfully


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


     Mileage
0     5 - 10
1  115 - 120
2    20 - 25

Total rows processed   : 174150
Wall Time (Elapsed)    : 0.3569 seconds
CPU Time               : 203 ms
CPU Usage              : 10.50%
Throughput             : 487891.87 rows/sec
Current Memory (Python): 0.26 MB
Peak Memory (Python)   : 0.75 MB


In [50]:
#Display the cleaned Dask DataFrame
df_dask.head()

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage,Transmission,Color,Price,Installment,Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10,Automatic,Black,375000,4862,Refurbished,"Selangor, Klang",Sales Agent,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120,Automatic,White,55999,726,Used,"Selangor, Kajang",Sales Agent,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25,Automatic,Grey,662222,8585,Refurbished,"Selangor, Klang",Sales Agent,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95,Automatic,Silver,67000,869,Used,"Johor, Ulu Tiram",Sales Agent,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85,Automatic,White,98999,1283,Used,"Selangor, Kajang",Sales Agent,5


### Replace Null Values

To handle missing values in the dataset, we replaced the null values in the categorical columns with the most frequent value (mode) of each respective column.

**1. Pandas**

In [51]:
# # Empty string
# (df_pandas == '').sum()          

# # Whitespace only
# df_pandas.applymap(lambda x: str(x).isspace()).sum() 

# # -
# (df_pandas == '-').sum()

# Convert to np.nan
df_pandas.replace(r'^\s*$', np.nan, regex=True, inplace=True)  # Empty and whitespace-only strings
df_pandas.replace('-', np.nan, inplace=True)  # "-" 

# Check null
df_pandas.isnull().sum()


Car Name               0
Car Brand              0
Car Model              0
Manufacture Year       0
Body Type             32
Fuel Type             42
Mileage             5359
Transmission           0
Color                  0
Price                  0
Installment            0
Condition              0
Location               0
Sales Channel       4264
Seat Capacity         97
dtype: int64

In [52]:
# Setup
process_null_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_null_pandas = time.perf_counter()
start_cpu_time_null_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_null_pandas = psutil.cpu_percent(interval=None)

# ==========================================================================

# === Data Cleaning Step: Drop rows with missing values ===
total_rows_before_cleaning_null_pandas = df_pandas.shape[0]

# Drop rows with missing values in specific columns
df_pandas.dropna(subset=['Body Type', 'Fuel Type', 'Mileage', 'Sales Channel', 'Seat Capacity'], inplace=True)

# ==========================================================================

total_rows_after_cleaning_null_pandas = df_pandas.shape[0]

# Record end time, memory, and CPU usage
end_time_null_pandas = time.perf_counter()
end_cpu_time_null_pandas = psutil.cpu_times().user  # End CPU time
current_mem_null_pandas, peak_mem_null_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_null_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_null_pandas = end_time_null_pandas - start_time_null_pandas
cpu_time_null_pandas = (end_cpu_time_null_pandas - start_cpu_time_null_pandas) * 1000  # CPU time in ms
throughput_null_pandas = total_rows_before_cleaning_null_pandas / elapsed_time_null_pandas if elapsed_time_null_pandas > 0 else 0

# Output preview
print("✅ Missing values dropped successfully")
print(df_pandas.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows before cleaning   : {total_rows_before_cleaning_null_pandas}")
print(f"Total rows after cleaning    : {total_rows_after_cleaning_null_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_null_pandas:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_null_pandas:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_null_pandas:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_null_pandas:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_null_pandas / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_null_pandas / 1e6:.2f} MB")
print("=================================================")


✅ Missing values dropped successfully
                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type    Mileage  \
0              2023       SUV  Petrol - Unleaded (ULP)     5 - 10   
1              2010       MPV  Petrol - Unleaded (ULP)  115 - 120   
2              2020     Coupe  Petrol - Unleaded (ULP)    20 - 25   

  Transmission  Color   Price  Installment    Condition           Location  \
0    Automatic  Black  375000         4862  Refurbished   Selangor, Klang    
1    Automatic  White   55999          726         Used  Selangor, Kajang    
2    Automatic   Grey  662222         8585  Refurbished   Selangor, Klang    

  Sales Channel Seat Capacity  
0   Sales Agent             5  
1  

In [53]:
df_pandas.isnull().sum()

Car Name            0
Car Brand           0
Car Model           0
Manufacture Year    0
Body Type           0
Fuel Type           0
Mileage             0
Transmission        0
Color               0
Price               0
Installment         0
Condition           0
Location            0
Sales Channel       0
Seat Capacity       0
dtype: int64

**2. Polars**

In [54]:
string_cols = [col for col, dtype in zip(df_polars.columns, df_polars.dtypes) if dtype == pl.Utf8]

df_polars = df_polars.with_columns([
    pl.when(
        pl.col(col).str.strip_chars().is_in(["", "-", "NaN"])
    )
    .then(None)
    .otherwise(pl.col(col))
    .alias(col)
    for col in string_cols
])

# Show null counts in vertical format 
for col in df_polars.columns:
    null_counts_before = df_polars.select(pl.col(col).null_count()).item()
    print(f"{col:<25} : {null_counts_before}")

Car Name                  : 0
Car Brand                 : 0
Car Model                 : 0
Manufacture Year          : 0
Body Type                 : 32
Fuel Type                 : 42
Mileage                   : 5359
Transmission              : 0
Color                     : 0
Price                     : 0
Installment               : 0
Condition                 : 0
Location                  : 0
Sales Channel             : 4264
Seat Capacity             : 97


In [55]:
# Setup
process_null_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_null_polars = time.perf_counter()
start_cpu_time_null_polars = psutil.cpu_times().user
start_cpu_percent_null_polars = psutil.cpu_percent(interval=None)

# ==========================================================================

# === Data Cleaning Step: Drop rows with missing values ===
total_rows_before_cleaning_null_polars = df_polars.shape[0]

# Drop rows with missing values in specific columns
columns_to_check = ['Body Type', 'Fuel Type', 'Mileage', 'Sales Channel', 'Seat Capacity']
df_polars = df_polars.drop_nulls(subset=columns_to_check)

# ==========================================================================

total_rows_after_cleaning_null_polars = df_polars.shape[0]

# Record end time and memory usage
end_time_null_polars = time.perf_counter()
end_cpu_time_null_polars = psutil.cpu_times().user
current_mem_null_polars, peak_mem_null_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU usage sampled over 1 second
cpu_usage_null_polars = psutil.cpu_percent(interval=1)

# Calculate metrics
elapsed_time_null_polars = end_time_null_polars - start_time_null_polars
cpu_time_null_polars = (end_cpu_time_null_polars - start_cpu_time_null_polars) * 1000  # in ms
throughput_null_polars = total_rows_before_cleaning_null_polars / elapsed_time_null_polars if elapsed_time_null_polars > 0 else 0

# Output preview
print("✅ Missing values dropped successfully")
print(df_polars.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows before cleaning   : {total_rows_before_cleaning_null_polars}")
print(f"Total rows after cleaning    : {total_rows_after_cleaning_null_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_null_polars:.4f} seconds")
print(f"CPU Time               : {cpu_time_null_polars:.0f} ms")
print(f"CPU Usage              : {cpu_usage_null_polars:.2f}%")
print(f"Throughput             : {throughput_null_polars:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_null_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_null_polars / 1e6:.2f} MB")
print("=================================================")


✅ Missing values dropped successfully
shape: (3, 15)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ Car Name  ┆ Car Brand ┆ Car Model ┆ Manufactu ┆ … ┆ Condition ┆ Location  ┆ Sales     ┆ Seat     │
│ ---       ┆ ---       ┆ ---       ┆ re Year   ┆   ┆ ---       ┆ ---       ┆ Channel   ┆ Capacity │
│ str       ┆ str       ┆ str       ┆ ---       ┆   ┆ str       ┆ str       ┆ ---       ┆ ---      │
│           ┆           ┆           ┆ i64       ┆   ┆           ┆           ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2023      ┆ Lexus     ┆ RX350     ┆ 2023      ┆ … ┆ Refurbish ┆ Selangor, ┆ Sales     ┆ 5        │
│ Lexus     ┆           ┆           ┆           ┆   ┆ ed        ┆ Klang     ┆ Agent     ┆          │
│ RX350 2.4 ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ F Sport   ┆           ┆           ┆ 

In [56]:
string_cols = [col for col, dtype in zip(df_polars.columns, df_polars.dtypes) if dtype == pl.Utf8]

df_polars = df_polars.with_columns([
    pl.when(
        pl.col(col).str.strip_chars().is_in(["", "-", "NaN"])
    )
    .then(None)
    .otherwise(pl.col(col))
    .alias(col)
    for col in string_cols
])

# Show null counts in vertical format 
for col in df_polars.columns:
    null_counts_after = df_polars.select(pl.col(col).null_count()).item()
    print(f"{col:<25} : {null_counts_after}")

Car Name                  : 0
Car Brand                 : 0
Car Model                 : 0
Manufacture Year          : 0
Body Type                 : 0
Fuel Type                 : 0
Mileage                   : 0
Transmission              : 0
Color                     : 0
Price                     : 0
Installment               : 0
Condition                 : 0
Location                  : 0
Sales Channel             : 0
Seat Capacity             : 0


**3. Modin**

In [57]:
# Convert to np.nan
df_modin.replace(r'^\s*$', np.nan, regex=True, inplace=True)  # Empty and whitespace-only strings
df_modin.replace('-', np.nan, inplace=True)  # "-" 

# Check null
df_modin.isnull().sum()

Car Name               0
Car Brand              0
Car Model              0
Manufacture Year       0
Body Type             32
Fuel Type             42
Mileage             5359
Transmission           0
Color                  0
Price                  0
Installment            0
Condition              0
Location               0
Sales Channel       4264
Seat Capacity         97
dtype: int64

In [58]:
# Setup
process_null_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_null_modin = time.perf_counter()
start_cpu_time_null_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_null_modin = psutil.cpu_percent(interval=None)

# ==========================================================================

# === Data Cleaning Step: Drop rows with missing values ===
total_rows_before_null_modin = df_modin.shape[0]

# Drop rows with missing values in specific columns
df_modin.dropna(subset=['Body Type', 'Fuel Type', 'Mileage', 'Sales Channel', 'Seat Capacity'], inplace=True)

# ==========================================================================

total_rows_after_null_modin = df_modin.shape[0]

# Record end time, memory, and CPU usage
end_time_null_modin = time.perf_counter()
end_cpu_time_null_modin = psutil.cpu_times().user  # End CPU time
current_mem_null_modin, peak_mem_null_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_null_modin = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_null_modin = end_time_null_modin - start_time_null_modin
cpu_time_null_modin = (end_cpu_time_null_modin - start_cpu_time_null_modin) * 1000  # CPU time in ms
throughput_null_modin = total_rows_before_null_modin / elapsed_time_null_modin if elapsed_time_null_modin > 0 else 0

# Output preview
print("✅ Missing values dropped successfully")
print(df_modin.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows before cleaning   : {total_rows_before_null_modin}")
print(f"Total rows after cleaning    : {total_rows_after_null_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_null_modin:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_null_modin:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_null_modin:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_null_modin:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_null_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_null_modin / 1e6:.2f} MB")
print("=================================================")

✅ Missing values dropped successfully
                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type    Mileage  \
0              2023       SUV  Petrol - Unleaded (ULP)     5 - 10   
1              2010       MPV  Petrol - Unleaded (ULP)  115 - 120   
2              2020     Coupe  Petrol - Unleaded (ULP)    20 - 25   

  Transmission  Color   Price  Installment    Condition           Location  \
0    Automatic  Black  375000         4862  Refurbished   Selangor, Klang    
1    Automatic  White   55999          726         Used  Selangor, Kajang    
2    Automatic   Grey  662222         8585  Refurbished   Selangor, Klang    

  Sales Channel Seat Capacity  
0   Sales Agent             5  
1  

In [59]:
df_modin.isnull().sum()

Car Name            0
Car Brand           0
Car Model           0
Manufacture Year    0
Body Type           0
Fuel Type           0
Mileage             0
Transmission        0
Color               0
Price               0
Installment         0
Condition           0
Location            0
Sales Channel       0
Seat Capacity       0
dtype: int64

**4. Dask**

In [60]:
# Convert to np.nan
df_dask = df_dask.replace(r'^\s*$', np.nan, regex=True)  # Empty and whitespace
df_dask = df_dask.replace('-', np.nan) # -

# Check for null values in each column
df_dask.isnull().sum().compute()

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Car Name               0
Car Brand              0
Car Model              0
Manufacture Year       0
Body Type             32
Fuel Type             42
Mileage             5359
Transmission           0
Color                  0
Price                  0
Installment            0
Condition              0
Location               0
Sales Channel       4264
Seat Capacity         97
dtype: int64

In [61]:
# Setup
process_null_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_null_dask = time.perf_counter()
start_cpu_time_null_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_null_dask = psutil.cpu_percent(interval=None)

# ========== Drop missing rows ==========
# Get total rows before cleaning
total_rows_before_null_dask = df_dask.shape[0].compute()

# Drop rows with missing values in specific columns
df_dask = df_dask.dropna(subset=['Body Type', 'Fuel Type', 'Mileage', 'Sales Channel', 'Seat Capacity'])

# ==========================================================================

# # Check the null values again to confirm replacements
# null_counts_after_null_dask = df_dask.isnull().sum().compute()
# print("\nNull values after replacement:")
# for column, count in null_counts_after_null_dask.items():
#     print(f"{column:<20} {count}")
# print("dtype: int64")

# Calculate total rows 
total_rows_after_null_dask = df_dask.shape[0].compute()

# Record end time, memory, and CPU usage
end_time_null_dask = time.perf_counter()
end_cpu_time_null_dask = psutil.cpu_times().user  # End CPU time
current_mem_null_dask, peak_mem_null_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_null_dask = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_null_dask = end_time_null_dask - start_time_null_dask
cpu_time_null_dask = (end_cpu_time_null_dask - start_cpu_time_null_dask) * 1000  # CPU time in ms
throughput_null_dask = total_rows_before_null_dask / elapsed_time_null_dask if elapsed_time_null_dask > 0 else 0

# Output preview
print("✅ Missing values dropped successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows before drop   : {total_rows_before_null_dask}")
print(f"Total rows after drop    : {total_rows_after_null_dask}")
print(f"Wall Time (Elapsed)   : {elapsed_time_null_dask:.4f} seconds")
print(f"CPU Time              : {cpu_time_null_dask:.0f} ms")
print(f"CPU Usage             : {cpu_usage_null_dask:.2f}%") 
print(f"Throughput            : {throughput_null_dask:.2f} rows/sec")
print(f"Current Memory (Python): {current_mem_null_dask / 1e6:.2f} MB")
print(f"Peak Memory (Python)   : {peak_mem_null_dask / 1e6:.2f} MB")  
print("=================================================")


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


✅ Missing values dropped successfully

Total rows before drop   : 174150
Total rows after drop    : 164516
Wall Time (Elapsed)   : 21.1509 seconds
CPU Time              : 40359 ms
CPU Usage             : 19.90%
Throughput            : 8233.69 rows/sec
Current Memory (Python): 5.05 MB
Peak Memory (Python)   : 344.34 MB


In [62]:
# Check for null values in each column
df_dask.isnull().sum().compute()

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Car Name            0
Car Brand           0
Car Model           0
Manufacture Year    0
Body Type           0
Fuel Type           0
Mileage             0
Transmission        0
Color               0
Price               0
Installment         0
Condition           0
Location            0
Sales Channel       0
Seat Capacity       0
dtype: int64

### Rename Columns

Renamed selected columns to include units for better understanding and presentation.

**1. Pandas**

In [63]:
# Setup
process_rename_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_rename_pandas = time.time()
start_cpu_time_rename_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_rename_pandas = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Column Renaming ===
total_rows_rename_pandas = df_pandas.shape[0]

# ======================

df_pandas.rename(columns={
    'Mileage': 'Mileage (K KM)',
    'Price': 'Price (RM)',
    'Installment': 'Installment (RM)'
}, inplace=True)

# ======================

# Record end time, memory, and CPU usage
end_time_rename_pandas = time.time()
end_cpu_time_rename_pandas = psutil.cpu_times().user  # End CPU time
current_mem_rename_pandas, peak_mem_rename_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_rename_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_rename_pandas = end_time_rename_pandas - start_time_rename_pandas  # Wall time in seconds
cpu_time_rename_pandas = (end_cpu_time_rename_pandas - start_cpu_time_rename_pandas) * 1000  # CPU time in ms
throughput_rename_pandas = total_rows_rename_pandas / elapsed_time_rename_pandas if elapsed_time_rename_pandas > 0 else 0

# Output preview
print("✅ Columns renamed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_rename_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_rename_pandas:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_rename_pandas:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_rename_pandas:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_rename_pandas:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_rename_pandas / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_rename_pandas / 1e6:.2f} MB")
print("=================================================")


✅ Columns renamed successfully

Total rows processed   : 164516
Wall Time (Elapsed)    : 0.0050 seconds
CPU Time               : 156 ms
CPU Usage             : 12.80%
Throughput            : 33195271.89 rows/sec
Current Memory (Python): 0.00 MB
Peak Memory (Python)   : 0.02 MB


In [64]:
# Display dataframe
df_pandas.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage (K KM),Transmission,Color,Price (RM),Installment (RM),Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10,Automatic,Black,375000,4862,Refurbished,"Selangor, Klang",Sales Agent,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120,Automatic,White,55999,726,Used,"Selangor, Kajang",Sales Agent,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25,Automatic,Grey,662222,8585,Refurbished,"Selangor, Klang",Sales Agent,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95,Automatic,Silver,67000,869,Used,"Johor, Ulu Tiram",Sales Agent,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85,Automatic,White,98999,1283,Used,"Selangor, Kajang",Sales Agent,5


**2. Polars**

In [65]:
# Setup
process_rename_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_rename_polars = time.time()
start_cpu_time_rename_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_rename_polars = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Column Renaming ===
total_rows_rename_polars = df_polars.height 

# ======================
df_polars = df_polars.rename({
    'Mileage': 'Mileage (K KM)',
    'Price': 'Price (RM)',
    'Installment': 'Installment (RM)'
})

# ======================

# Record end time, memory, and CPU usage
end_time_rename_polars = time.time()
end_cpu_time_rename_polars = psutil.cpu_times().user  # End CPU time
current_mem_rename_polars, peak_mem_rename_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_rename_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_rename_polars = end_time_rename_polars - start_time_rename_polars  # Wall time in seconds
cpu_time_rename_polars = (end_cpu_time_rename_polars - start_cpu_time_rename_polars) * 1000  # CPU time in ms
throughput_rename_polars = total_rows_rename_polars / elapsed_time_rename_polars if elapsed_time_rename_polars > 0 else 0

# Output preview
print("✅ Columns renamed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_rename_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_rename_polars:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_rename_polars:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_rename_polars:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_rename_polars:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_rename_polars / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_rename_polars / 1e6:.2f} MB")
print("=================================================")


✅ Columns renamed successfully

Total rows processed   : 164516
Wall Time (Elapsed)    : 0.0046 seconds
CPU Time               : 0 ms
CPU Usage             : 12.50%
Throughput            : 36049846.76 rows/sec
Current Memory (Python): 0.01 MB
Peak Memory (Python)   : 0.03 MB


In [66]:
df_polars.head()

Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage (K KM),Transmission,Color,Price (RM),Installment (RM),Condition,Location,Sales Channel,Seat Capacity
str,str,str,i64,str,str,str,str,str,i64,i64,str,str,str,str
"""2023 Lexus RX350 2.4 F Sport S…","""Lexus""","""RX350""",2023,"""SUV""","""Petrol - Unleaded (ULP)""","""5 - 10""","""Automatic""","""Black""",375000,4862,"""Refurbished""","""Selangor, Klang ""","""Sales Agent""","""5"""
"""2010 Toyota Estima 2.4 Aeras M…","""Toyota""","""Estima""",2010,"""MPV""","""Petrol - Unleaded (ULP)""","""115 - 120""","""Automatic""","""White""",55999,726,"""Used""","""Selangor, Kajang ""","""Sales Agent""","""7"""
"""2020 Porsche Cayenne Coupe 4.0…","""Porsche""","""Cayenne""",2020,"""Coupe""","""Petrol - Unleaded (ULP)""","""20 - 25""","""Automatic""","""Grey""",662222,8585,"""Refurbished""","""Selangor, Klang ""","""Sales Agent""","""4"""
"""2021 Honda City 1.5 V i-VTEC H…","""Honda""","""City""",2021,"""Hatchback""","""Petrol - Unleaded (ULP)""","""90 - 95""","""Automatic""","""Silver""",67000,869,"""Used""","""Johor, Ulu Tiram ""","""Sales Agent""","""5"""
"""2022 Toyota Corolla Cross 1.8 …","""Toyota""","""Corolla Cross""",2022,"""SUV""","""Petrol - Unleaded (ULP)""","""80 - 85""","""Automatic""","""White""",98999,1283,"""Used""","""Selangor, Kajang ""","""Sales Agent""","""5"""


**3. Modin**

In [67]:
# Setup
process_rename_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_rename_modin = time.time()
start_cpu_time_rename_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_rename_modin = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Column Renaming ===
total_rows_rename_modin = df_modin.shape[0]

# ======================

df_modin.rename(columns={
    'Mileage': 'Mileage (K KM)',
    'Price': 'Price (RM)',
    'Installment': 'Installment (RM)'
}, inplace=True)

# ======================

# Record end time, memory, and CPU usage
end_time_rename_modin = time.time()
end_cpu_time_rename_modin = psutil.cpu_times().user  # End CPU time
current_mem_rename_modin, peak_mem_rename_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_rename_modin = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_rename_modin = end_time_rename_modin - start_time_rename_modin  # Wall time in seconds
cpu_time_rename_modin = (end_cpu_time_rename_modin - start_cpu_time_rename_modin) * 1000  # CPU time in ms
throughput_rename_modin = total_rows_rename_modin / elapsed_time_rename_modin if elapsed_time_rename_modin > 0 else 0

# Output preview
print("✅ Columns renamed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_rename_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_rename_modin:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_rename_modin:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_rename_modin:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_rename_modin:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_rename_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_rename_modin / 1e6:.2f} MB")
print("=================================================")


✅ Columns renamed successfully

Total rows processed   : 164516
Wall Time (Elapsed)    : 0.0045 seconds
CPU Time               : 0 ms
CPU Usage             : 18.70%
Throughput            : 36660828.65 rows/sec
Current Memory (Python): 0.02 MB
Peak Memory (Python)   : 0.04 MB


In [68]:
df_modin.head()

Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage (K KM),Transmission,Color,Price (RM),Installment (RM),Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10,Automatic,Black,375000,4862,Refurbished,"Selangor, Klang",Sales Agent,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120,Automatic,White,55999,726,Used,"Selangor, Kajang",Sales Agent,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25,Automatic,Grey,662222,8585,Refurbished,"Selangor, Klang",Sales Agent,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95,Automatic,Silver,67000,869,Used,"Johor, Ulu Tiram",Sales Agent,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85,Automatic,White,98999,1283,Used,"Selangor, Kajang",Sales Agent,5


**4. Dask**

In [69]:
# Setup
process_rename_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_rename_dask = time.time()
start_cpu_time_rename_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_rename_dask = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Column Renaming ===
total_rows_rename_dask = df_dask.shape[0].compute()

# Rename columns in Dask DataFrame
df_dask = df_dask.rename(columns={
    'Mileage': 'Mileage (K KM)',
    'Price': 'Price (RM)',
    'Installment': 'Installment (RM)'
})

# Record end time, memory, and CPU usage
end_time_rename_dask = time.time()
end_cpu_time_rename_dask = psutil.cpu_times().user  # End CPU time
current_mem_rename_dask, peak_mem_rename_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_rename_dask = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_rename_dask = end_time_rename_dask - start_time_rename_dask  # Wall time in seconds
cpu_time_rename_dask = (end_cpu_time_rename_dask - start_cpu_time_rename_dask) * 1000  # CPU time in ms
throughput_rename_dask = total_rows_rename_dask / elapsed_time_rename_dask if elapsed_time_rename_dask > 0 else 0

# Output preview
print("✅ Columns renamed successfully")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {total_rows_rename_dask}")
print(f"Wall Time (Elapsed)    : {elapsed_time_rename_dask:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_rename_dask:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_rename_dask:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_rename_dask:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_rename_dask / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_rename_dask / 1e6:.2f} MB")
print("=================================================")


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


✅ Columns renamed successfully

Total rows processed   : 164516
Wall Time (Elapsed)    : 20.1589 seconds
CPU Time               : 33641 ms
CPU Usage             : 12.00%
Throughput            : 8160.95 rows/sec
Current Memory (Python): 4.83 MB
Peak Memory (Python)   : 344.34 MB


In [70]:
#Display dataframe
df_dask.head()

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Unnamed: 0,Car Name,Car Brand,Car Model,Manufacture Year,Body Type,Fuel Type,Mileage (K KM),Transmission,Color,Price (RM),Installment (RM),Condition,Location,Sales Channel,Seat Capacity
0,2023 Lexus RX350 2.4 F Sport SUV,Lexus,RX350,2023,SUV,Petrol - Unleaded (ULP),5 - 10,Automatic,Black,375000,4862,Refurbished,"Selangor, Klang",Sales Agent,5
1,2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...,Toyota,Estima,2010,MPV,Petrol - Unleaded (ULP),115 - 120,Automatic,White,55999,726,Used,"Selangor, Kajang",Sales Agent,7
2,2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...,Porsche,Cayenne,2020,Coupe,Petrol - Unleaded (ULP),20 - 25,Automatic,Grey,662222,8585,Refurbished,"Selangor, Klang",Sales Agent,4
3,2021 Honda City 1.5 V i-VTEC Hatchback,Honda,City,2021,Hatchback,Petrol - Unleaded (ULP),90 - 95,Automatic,Silver,67000,869,Used,"Johor, Ulu Tiram",Sales Agent,5
4,2022 Toyota Corolla Cross 1.8 V SUV Full Servi...,Toyota,Corolla Cross,2022,SUV,Petrol - Unleaded (ULP),80 - 85,Automatic,White,98999,1283,Used,"Selangor, Kajang",Sales Agent,5


### Drop Duplicated Rows

**1. Pandas**

In [71]:
# Check for duplicate rows
duplicates_before_pandas = df_pandas.duplicated().sum()
print(f"Number of duplicate rows: {duplicates_before_pandas}")

Number of duplicate rows: 3722


In [72]:
# Drop duplicate rows
# Setup
process_drop_duplicates_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_drop_duplicates_pandas = time.time()
start_cpu_time_drop_duplicates_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_drop_duplicates_pandas = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Drop Duplicate Rows ===
df_pandas = df_pandas.drop_duplicates()
# ===========================

# Record end time, memory, and CPU usage
end_time_drop_duplicates_pandas = time.time()
end_cpu_time_drop_duplicates_pandas = psutil.cpu_times().user  # End CPU time
current_mem_drop_duplicates_pandas, peak_mem_drop_duplicates_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_drop_duplicates_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_drop_duplicates_pandas = end_time_drop_duplicates_pandas - start_time_drop_duplicates_pandas  # Wall time in seconds
cpu_time_drop_duplicates_pandas = (end_cpu_time_drop_duplicates_pandas - start_cpu_time_drop_duplicates_pandas) * 1000  # CPU time in ms
throughput_drop_duplicates_pandas = df_pandas.shape[0] / elapsed_time_drop_duplicates_pandas if elapsed_time_drop_duplicates_pandas > 0 else 0

# Output preview
print("✅ Duplicate rows dropped")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows after dropping   : {df_pandas.shape[0]}")
print(f"Wall Time (Elapsed)    : {elapsed_time_drop_duplicates_pandas:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_drop_duplicates_pandas:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_drop_duplicates_pandas:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_drop_duplicates_pandas:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_drop_duplicates_pandas / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_drop_duplicates_pandas / 1e6:.2f} MB")
print("=================================================")


✅ Duplicate rows dropped

Total rows after dropping   : 160794
Wall Time (Elapsed)    : 0.4250 seconds
CPU Time               : 406 ms
CPU Usage             : 11.10%
Throughput            : 378294.51 rows/sec
Current Memory (Python): 20.65 MB
Peak Memory (Python)   : 30.84 MB


In [73]:
# Check for duplicate rows after dropping
duplicates_after_pandas = df_pandas.duplicated().sum()
print(f"Number of duplicate rows after dropping: {duplicates_after_pandas}")

Number of duplicate rows after dropping: 0


**2. Polars**

In [74]:
# Setup
process_drop_duplicates_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_drop_duplicates_polars = time.time()
start_cpu_time_drop_duplicates_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_drop_duplicates_polars = psutil.cpu_percent(interval=None)  # Initial CPU usage

# Get initial number of rows before dropping duplicates
initial_row_count = df_polars.height

# === Drop Duplicate Rows in Polars ===
df_polars = df_polars.unique()  # Drops duplicate rows
# =========================

# Get final number of rows after dropping duplicates
final_row_count = df_polars.height
rows_dropped = initial_row_count - final_row_count  # Calculate rows dropped

# Record end time, memory, and CPU usage
end_time_drop_duplicates_polars = time.time()
end_cpu_time_drop_duplicates_polars = psutil.cpu_times().user  # End CPU time
current_mem_drop_duplicates_polars, peak_mem_drop_duplicates_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_drop_duplicates_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_drop_duplicates_polars = end_time_drop_duplicates_polars - start_time_drop_duplicates_polars  # Wall time in seconds
cpu_time_drop_duplicates_polars = (end_cpu_time_drop_duplicates_polars - start_cpu_time_drop_duplicates_polars) * 1000  # CPU time in ms
throughput_drop_duplicates_polars = df_polars.height / elapsed_time_drop_duplicates_polars if elapsed_time_drop_duplicates_polars > 0 else 0

# Output preview
print(f"✅ {rows_dropped} duplicate rows dropped")  # Show how many rows were dropped

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows after dropping   : {df_polars.height}")
print(f"Wall Time (Elapsed)    : {elapsed_time_drop_duplicates_polars:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_drop_duplicates_polars:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_drop_duplicates_polars:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_drop_duplicates_polars:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_drop_duplicates_polars / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_drop_duplicates_polars / 1e6:.2f} MB")
print("=================================================")


✅ 3722 duplicate rows dropped

Total rows after dropping   : 160794
Wall Time (Elapsed)    : 0.0435 seconds
CPU Time               : 141 ms
CPU Usage             : 15.80%
Throughput            : 3696134.72 rows/sec
Current Memory (Python): 0.01 MB
Peak Memory (Python)   : 0.03 MB


In [75]:
# Check for duplicate rows in Polars
duplicates_after_polars = df_polars.is_duplicated().sum()
print(f"Number of duplicate rows after dropping: {duplicates_after_polars}")


Number of duplicate rows after dropping: 0


**3. Modin**

In [76]:
# Check for duplicate rows
duplicates_before_modin = df_modin.duplicated().sum()
print(f"Number of duplicate rows: {duplicates_before_modin}")

Number of duplicate rows: 3722


In [77]:
# Drop duplicate rows for Modin
# Setup
process_drop_duplicates_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_drop_duplicates_modin = time.time()
start_cpu_time_drop_duplicates_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_drop_duplicates_modin = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Drop Duplicate Rows ===
df_modin = df_modin.drop_duplicates()
# ===========================

# Record end time, memory, and CPU usage
end_time_drop_duplicates_modin = time.time()
end_cpu_time_drop_duplicates_modin = psutil.cpu_times().user  # End CPU time
current_mem_drop_duplicates_modin, peak_mem_drop_duplicates_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_drop_duplicates_modin = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_drop_duplicates_modin = end_time_drop_duplicates_modin - start_time_drop_duplicates_modin  # Wall time in seconds
cpu_time_drop_duplicates_modin = (end_cpu_time_drop_duplicates_modin - start_cpu_time_drop_duplicates_modin) * 1000  # CPU time in ms
throughput_drop_duplicates_modin = df_modin.shape[0] / elapsed_time_drop_duplicates_modin if elapsed_time_drop_duplicates_modin > 0 else 0

# Output preview
print("✅ Duplicate rows dropped")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows after dropping  : {df_modin.shape[0]}")
print(f"Wall Time (Elapsed)    : {elapsed_time_drop_duplicates_modin:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_drop_duplicates_modin:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_drop_duplicates_modin:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_drop_duplicates_modin:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_drop_duplicates_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_drop_duplicates_modin / 1e6:.2f} MB")
print("=================================================")


✅ Duplicate rows dropped

Total rows after dropping  : 160794
Wall Time (Elapsed)    : 6.2243 seconds
CPU Time               : 9531 ms
CPU Usage             : 44.70%
Throughput            : 25833.47 rows/sec
Current Memory (Python): 8.46 MB
Peak Memory (Python)   : 8.97 MB


In [78]:
# Check for duplicate rows after dropping
duplicates_after_modin = df_modin.duplicated().sum()
print(f"Number of duplicate rows after dropping: {duplicates_after_modin}")

Number of duplicate rows after dropping: 0


**4. Dask**

In [79]:
# Convert to pandas
# Compute once to pandas DataFrame

print("Computing Dask DataFrame to pandas...")
df_pandas_duplicates_before = df_dask.compute()

# Use pandas' duplicated method
duplicates_before_dask = df_pandas_duplicates_before.duplicated().sum()
print(f"Number of duplicate rows: {duplicates_before_dask}")

Computing Dask DataFrame to pandas...


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Number of duplicate rows: 3722


In [80]:
# Drop duplicate rows for Dask
# Setup
process_drop_duplicates_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_drop_duplicates_dask = time.time()
start_cpu_time_drop_duplicates_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_drop_duplicates_dask = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Drop Duplicate Rows ===
df_dask = df_dask.drop_duplicates()

# Record end time, memory, and CPU usage
end_time_drop_duplicates_dask = time.time()
end_cpu_time_drop_duplicates_dask = psutil.cpu_times().user  # End CPU time
current_mem_drop_duplicates_dask, peak_mem_drop_duplicates_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_drop_duplicates_dask = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_drop_duplicates_dask = end_time_drop_duplicates_dask - start_time_drop_duplicates_dask  # Wall time in seconds
cpu_time_drop_duplicates_dask = (end_cpu_time_drop_duplicates_dask - start_cpu_time_drop_duplicates_dask) * 1000  # CPU time in ms
throughput_drop_duplicates_dask = df_dask.shape[0].compute() / elapsed_time_drop_duplicates_dask if elapsed_time_drop_duplicates_dask > 0 else 0

# Output preview
print("✅ Duplicate rows dropped")

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows after dropping  : {df_dask.shape[0].compute()}")
print(f"Wall Time (Elapsed)    : {elapsed_time_drop_duplicates_dask:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_drop_duplicates_dask:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_drop_duplicates_dask:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_drop_duplicates_dask:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_drop_duplicates_dask / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_drop_duplicates_dask / 1e6:.2f} MB")
print("=================================================")


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


✅ Duplicate rows dropped



This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Total rows after dropping  : 160794
Wall Time (Elapsed)    : 0.0668 seconds
CPU Time               : 344 ms
CPU Usage             : 11.50%
Throughput            : 2407074.39 rows/sec
Current Memory (Python): 0.09 MB
Peak Memory (Python)   : 0.33 MB


In [81]:
df_pandas_duplicates_after = df_dask.compute()
duplicates_after_modin = df_pandas_duplicates_after.duplicated().sum()
print(f"Number of duplicate rows after dropping: {duplicates_after_modin}")

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Number of duplicate rows after dropping: 0


### Optimize Memory Usage by Using Efficient Data Types

The dataset may contain columns with default data types that are not memory-efficient. To improve performance and reduce memory consumption, we will convert columns to more suitable data types, such as downcasting numerical columns and converting categorical data to the category type.

**1. Pandas**

In [82]:
# Check the initial info about the dataset
df_pandas.info()


<class 'pandas.core.frame.DataFrame'>
Index: 160794 entries, 0 to 174149
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Car Name          160794 non-null  object
 1   Car Brand         160794 non-null  object
 2   Car Model         160794 non-null  object
 3   Manufacture Year  160794 non-null  int64 
 4   Body Type         160794 non-null  object
 5   Fuel Type         160794 non-null  object
 6   Mileage (K KM)    160794 non-null  object
 7   Transmission      160794 non-null  object
 8   Color             160794 non-null  object
 9   Price (RM)        160794 non-null  int64 
 10  Installment (RM)  160794 non-null  int64 
 11  Condition         160794 non-null  object
 12  Location          160794 non-null  object
 13  Sales Channel     160794 non-null  object
 14  Seat Capacity     160794 non-null  object
dtypes: int64(3), object(12)
memory usage: 19.6+ MB


In [83]:
# Check the initial memory usage for pandas dataset
start_mem_check_pandas = df_pandas.memory_usage().sum() / 1024**2  # Convert bytes to megabytes
print(f"Initial memory usage: {start_mem_check_pandas:.2f} MB")

Initial memory usage: 19.63 MB


In [84]:
# Setup
process_numeric_downcasting_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_numeric_downcasting_pandas = time.time()
start_cpu_time_numeric_downcasting_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_numeric_downcasting_pandas = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Numeric Downcasting and Categorical Conversion for pandas ===
df_pandas['Manufacture Year'] = df_pandas['Manufacture Year'].astype('int16')
df_pandas[['Price (RM)', 'Installment (RM)']] = df_pandas[['Price (RM)', 'Installment (RM)']].astype('int32')
df_pandas['Seat Capacity'] = pd.to_numeric(df_pandas['Seat Capacity'], errors='coerce').astype('int8')

# Categorical conversion
categorical_columns_numeric_downcasting_pandas = ['Car Brand', 'Car Model', 'Body Type', 'Fuel Type',
                                                  'Transmission', 'Color', 'Condition', 'Sales Channel', 'Mileage (K KM)']
df_pandas[categorical_columns_numeric_downcasting_pandas] = df_pandas[categorical_columns_numeric_downcasting_pandas].apply(lambda x: x.astype('category'))
# =====================================================

# Record end time, memory, and CPU usage
end_time_numeric_downcasting_pandas = time.time()
end_cpu_time_numeric_downcasting_pandas = psutil.cpu_times().user  # End CPU time
current_mem_numeric_downcasting_pandas, peak_mem_numeric_downcasting_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_numeric_downcasting_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_numeric_downcasting_pandas = end_time_numeric_downcasting_pandas - start_time_numeric_downcasting_pandas  # Wall time in seconds
cpu_time_numeric_downcasting_pandas = (end_cpu_time_numeric_downcasting_pandas - start_cpu_time_numeric_downcasting_pandas) * 1000  # CPU time in ms
throughput_numeric_downcasting_pandas = df_pandas.shape[0] / elapsed_time_numeric_downcasting_pandas if elapsed_time_numeric_downcasting_pandas > 0 else 0

# Output preview
print("✅ Numeric downcasting and categorical conversion completed")
print(df_pandas.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {df_pandas.shape[0]}")
print(f"Wall Time (Elapsed)    : {elapsed_time_numeric_downcasting_pandas:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_numeric_downcasting_pandas:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_numeric_downcasting_pandas:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_numeric_downcasting_pandas:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_numeric_downcasting_pandas / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_numeric_downcasting_pandas / 1e6:.2f} MB")
print("=================================================")


✅ Numeric downcasting and categorical conversion completed
                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type Mileage (K KM)  \
0              2023       SUV  Petrol - Unleaded (ULP)         5 - 10   
1              2010       MPV  Petrol - Unleaded (ULP)      115 - 120   
2              2020     Coupe  Petrol - Unleaded (ULP)        20 - 25   

  Transmission  Color  Price (RM)  Installment (RM)    Condition  \
0    Automatic  Black      375000              4862  Refurbished   
1    Automatic  White       55999               726         Used   
2    Automatic   Grey      662222              8585  Refurbished   

            Location Sales Channel  Seat Capacity  
0   Selangor, Klan

In [85]:
# Check the data types of all columns after the conversion
df_pandas.dtypes

Car Name              object
Car Brand           category
Car Model           category
Manufacture Year       int16
Body Type           category
Fuel Type           category
Mileage (K KM)      category
Transmission        category
Color               category
Price (RM)             int32
Installment (RM)       int32
Condition           category
Location              object
Sales Channel       category
Seat Capacity           int8
dtype: object

In [86]:
# Calculate memory usage after the conversion
end_mem_check_pandas = df_pandas.memory_usage().sum() / 1024**2  # Convert bytes to megabytes
print(f"Memory usage after optimization: {end_mem_check_pandas:.2f} MB")

# Print the memory saved
print(f"Memory saved: {start_mem_check_pandas - end_mem_check_pandas:.2f} MB")

Memory usage after optimization: 6.95 MB
Memory saved: 12.68 MB


**2. Polars**

In [87]:
# Information about the DataFrame
print(df_polars.schema)
print(f"\nShape: {df_polars.shape[0]} rows × {df_polars.shape[1]} columns")

Schema({'Car Name': String, 'Car Brand': String, 'Car Model': String, 'Manufacture Year': Int64, 'Body Type': String, 'Fuel Type': String, 'Mileage (K KM)': String, 'Transmission': String, 'Color': String, 'Price (RM)': Int64, 'Installment (RM)': Int64, 'Condition': String, 'Location': String, 'Sales Channel': String, 'Seat Capacity': String})

Shape: 160794 rows × 15 columns


In [88]:
# Check initial memory usage in MB
start_mem_check_polars = df_polars.estimated_size() / 1024**2  # in MB
print(f"Initial memory usage: {start_mem_check_polars:.2f} MB")

Initial memory usage: 29.28 MB


In [89]:
# Setup
process_numeric_downcasting_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_numeric_downcasting_polars = time.time()
start_cpu_time_numeric_downcasting_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_numeric_downcasting_polars = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Numeric Downcasting and Categorical Conversion for Polars ===
df_polars = df_polars.with_columns([
    pl.col("Manufacture Year").cast(pl.Int16),
    pl.col("Price (RM)").cast(pl.Int32),
    pl.col("Installment (RM)").cast(pl.Int32),
    pl.col("Seat Capacity").cast(pl.Int8),
])

# Categorical conversion
categorical_columns_numeric_downcasting_polars = ['Car Brand', 'Car Model', 'Body Type', 'Fuel Type',
                                                  'Transmission', 'Color', 'Condition', 'Sales Channel', 'Mileage (K KM)']
for col in categorical_columns_numeric_downcasting_polars:
    df_polars = df_polars.with_columns([
        pl.col(col).cast(pl.Categorical)
    ])

# =====================================================

# Record end time, memory, and CPU usage
end_time_numeric_downcasting_polars = time.time()
end_cpu_time_numeric_downcasting_polars = psutil.cpu_times().user  # End CPU time
current_mem_numeric_downcasting_polars, peak_mem_numeric_downcasting_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_numeric_downcasting_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_numeric_downcasting_polars = end_time_numeric_downcasting_polars - start_time_numeric_downcasting_polars  # Wall time in seconds
cpu_time_numeric_downcasting_polars = (end_cpu_time_numeric_downcasting_polars - start_cpu_time_numeric_downcasting_polars) * 1000  # CPU time in ms
throughput_numeric_downcasting_polars = df_polars.shape[0] / elapsed_time_numeric_downcasting_polars if elapsed_time_numeric_downcasting_polars > 0 else 0

# Output preview
print("✅ Numeric downcasting and categorical conversion completed")
print(df_polars.head(3))

# Performance summary
print("\n============== Polars Performance Summary ==============")
print(f"Total rows processed     : {df_polars.shape[0]}")
print(f"Wall Time (Elapsed)      : {elapsed_time_numeric_downcasting_polars:.4f} seconds")
print(f"CPU Time                 : {cpu_time_numeric_downcasting_polars:.0f} ms")
print(f"CPU Usage               : {cpu_usage_numeric_downcasting_polars:.2f}%")
print(f"Throughput              : {throughput_numeric_downcasting_polars:.2f} rows/sec")
print(f"Current Memory (Python) : {current_mem_numeric_downcasting_polars / 1e6:.2f} MB")
print(f"Peak Memory (Python)    : {peak_mem_numeric_downcasting_polars / 1e6:.2f} MB")
print("========================================================")

✅ Numeric downcasting and categorical conversion completed
shape: (3, 15)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ Car Name  ┆ Car Brand ┆ Car Model ┆ Manufactu ┆ … ┆ Condition ┆ Location  ┆ Sales     ┆ Seat     │
│ ---       ┆ ---       ┆ ---       ┆ re Year   ┆   ┆ ---       ┆ ---       ┆ Channel   ┆ Capacity │
│ str       ┆ cat       ┆ cat       ┆ ---       ┆   ┆ cat       ┆ str       ┆ ---       ┆ ---      │
│           ┆           ┆           ┆ i16       ┆   ┆           ┆           ┆ cat       ┆ i8       │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2022      ┆ Toyota    ┆ Corolla   ┆ 2022      ┆ … ┆ Used      ┆ Johor,    ┆ Dealer    ┆ 5        │
│ Toyota    ┆           ┆ Cross     ┆           ┆   ┆           ┆ Johor     ┆           ┆          │
│ Corolla   ┆           ┆           ┆           ┆   ┆           ┆ Bahru     ┆           ┆          │
│ Cross 1.8 ┆    

In [90]:
print("\nColumn Data Types After Optimization:")
print(df_polars.dtypes)


Column Data Types After Optimization:
[String, Categorical(ordering='physical'), Categorical(ordering='physical'), Int16, Categorical(ordering='physical'), Categorical(ordering='physical'), Categorical(ordering='physical'), Categorical(ordering='physical'), Categorical(ordering='physical'), Int32, Int32, Categorical(ordering='physical'), String, Categorical(ordering='physical'), Int8]


In [91]:
end_mem_check_polars = df_polars.estimated_size() / 1024**2
print(f"\nMemory usage after optimization: {end_mem_check_polars:.2f} MB")
print(f"Memory saved: {start_mem_check_polars - end_mem_check_polars:.2f} MB")


Memory usage after optimization: 21.28 MB
Memory saved: 8.00 MB


**3. Modin**

In [92]:
# Check the initial info about the dataset
df_modin.info()

<class 'modin.pandas.dataframe.DataFrame'>
Index: 160794 entries, 0 to 174149
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Car Name          160794 non-null  object
 1   Car Brand         160794 non-null  object
 2   Car Model         160794 non-null  object
 3   Manufacture Year  160794 non-null  int64 
 4   Body Type         160794 non-null  object
 5   Fuel Type         160794 non-null  object
 6   Mileage (K KM)    160794 non-null  object
 7   Transmission      160794 non-null  object
 8   Color             160794 non-null  object
 9   Price (RM)        160794 non-null  int64 
 10  Installment (RM)  160794 non-null  int64 
 11  Condition         160794 non-null  object
 12  Location          160794 non-null  object
 13  Sales Channel     160794 non-null  object
 14  Seat Capacity     160794 non-null  object
dtypes: int64(3), object(12)
memory usage: 19.6+ MB


In [93]:
# Calculate initial memory usage in MB
start_mem_check_modin = df_modin.memory_usage().sum() / 1024**2  # Convert bytes to megabytes
print(f"Initial memory usage: {start_mem_check_modin:.2f} MB")

Initial memory usage: 19.63 MB


In [94]:
# Setup
process_numeric_downcasting_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_numeric_downcasting_modin = time.time()
start_cpu_time_numeric_downcasting_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_numeric_downcasting_modin = psutil.cpu_percent(interval=None)  # Initial CPU usage

# === Numeric Downcasting and Categorical Conversion for Modin ===
df_modin['Manufacture Year'] = df_modin['Manufacture Year'].astype('int16')
df_modin[['Price (RM)', 'Installment (RM)']] = df_modin[['Price (RM)', 'Installment (RM)']].astype('int32')
df_modin['Seat Capacity'] = pd.to_numeric(df_modin['Seat Capacity'], errors='coerce').astype('int8')

# Categorical conversion
categorical_columns_numeric_downcasting_modin = ['Car Brand', 'Car Model', 'Body Type', 'Fuel Type',
                                                 'Transmission', 'Color', 'Condition', 'Sales Channel', 'Mileage (K KM)']
df_modin[categorical_columns_numeric_downcasting_modin] = df_modin[categorical_columns_numeric_downcasting_modin].apply(lambda x: x.astype('category'))
# =====================================================

# Record end time, memory, and CPU usage
end_time_numeric_downcasting_modin = time.time()
end_cpu_time_numeric_downcasting_modin = psutil.cpu_times().user  # End CPU time
current_mem_numeric_downcasting_modin, peak_mem_numeric_downcasting_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_numeric_downcasting_modin = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_numeric_downcasting_modin = end_time_numeric_downcasting_modin - start_time_numeric_downcasting_modin  # Wall time in seconds
cpu_time_numeric_downcasting_modin = (end_cpu_time_numeric_downcasting_modin - start_cpu_time_numeric_downcasting_modin) * 1000  # CPU time in ms
throughput_numeric_downcasting_modin = df_modin.shape[0] / elapsed_time_numeric_downcasting_modin if elapsed_time_numeric_downcasting_modin > 0 else 0

# Output preview
print("✅ Numeric downcasting and categorical conversion completed")
print(df_modin.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {df_modin.shape[0]}")
print(f"Wall Time (Elapsed)    : {elapsed_time_numeric_downcasting_modin:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_numeric_downcasting_modin:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_numeric_downcasting_modin:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_numeric_downcasting_modin:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_numeric_downcasting_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_numeric_downcasting_modin / 1e6:.2f} MB")
print("=================================================")


✅ Numeric downcasting and categorical conversion completed
                                            Car Name Car Brand Car Model  \
0                   2023 Lexus RX350 2.4 F Sport SUV     Lexus     RX350   
1  2010 Toyota Estima 2.4 Aeras MPV Hot Mpv Car I...    Toyota    Estima   
2  2020 Porsche Cayenne Coupe 4.0 V8 Turbo AWD Un...   Porsche   Cayenne   

   Manufacture Year Body Type                Fuel Type Mileage (K KM)  \
0              2023       SUV  Petrol - Unleaded (ULP)         5 - 10   
1              2010       MPV  Petrol - Unleaded (ULP)      115 - 120   
2              2020     Coupe  Petrol - Unleaded (ULP)        20 - 25   

  Transmission  Color  Price (RM)  Installment (RM)    Condition  \
0    Automatic  Black      375000              4862  Refurbished   
1    Automatic  White       55999               726         Used   
2    Automatic   Grey      662222              8585  Refurbished   

            Location Sales Channel  Seat Capacity  
0   Selangor, Klan

In [95]:
# Check the initial info about the dataset
df_modin.info()

<class 'modin.pandas.dataframe.DataFrame'>
Index: 160794 entries, 0 to 174149
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   Car Name          160794 non-null  object  
 1   Car Brand         160794 non-null  category
 2   Car Model         160794 non-null  category
 3   Manufacture Year  160794 non-null  int16   
 4   Body Type         160794 non-null  category
 5   Fuel Type         160794 non-null  category
 6   Mileage (K KM)    160794 non-null  category
 7   Transmission      160794 non-null  category
 8   Color             160794 non-null  category
 9   Price (RM)        160794 non-null  int32   
 10  Installment (RM)  160794 non-null  int32   
 11  Condition         160794 non-null  category
 12  Location          160794 non-null  object  
 13  Sales Channel     160794 non-null  category
 14  Seat Capacity     160794 non-null  int8    
dtypes: category(9), int16(1), int32(2), int8(1), object

In [96]:
# Calculate memory usage after the conversion
end_mem_check_modin = df_modin.memory_usage().sum() / 1024**2  # Convert bytes to megabytes
print(f"Memory usage after optimization: {end_mem_check_modin:.2f} MB")

# Print the memory saved
print(f"Memory saved: {start_mem_check_modin - end_mem_check_modin:.2f} MB")

Memory usage after optimization: 7.44 MB
Memory saved: 12.19 MB


**4. Dask**

In [97]:
# Check the initial info about the dataset
df_dask.info()


<class 'dask.dataframe.dask_expr.DataFrame'>
Columns: 15 entries, Car Name to Seat Capacity
dtypes: object(2), int64(3), string(10)

In [98]:
# Calculate initial memory usage in MB
start_mem_check_dask = df_dask.memory_usage().sum().compute() / 1024**2  # Convert bytes to megabytes
print(f"Initial memory usage: {start_mem_check_dask:.2f} MB")


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Initial memory usage: 43.44 MB


In [99]:
# Setup
process_numeric_downcasting_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_numeric_downcasting_dask = time.time()
start_cpu_time_numeric_downcasting_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_numeric_downcasting_dask = psutil.cpu_percent(interval=None)  # Initial CPU usage

# Numeric downcasting for Dask
df_dask['Manufacture Year'] = df_dask['Manufacture Year'].astype('int16')
df_dask[['Price (RM)', 'Installment (RM)']] = df_dask[['Price (RM)', 'Installment (RM)']].astype('int32')
df_dask['Seat Capacity'] = dd.to_numeric(df_dask['Seat Capacity'], errors='coerce').astype('int8')

# Categorical conversion
categorical_columns_numeric_downcasting_dask = ['Car Brand', 'Car Model', 'Body Type', 'Fuel Type',
                                                'Transmission', 'Color', 'Condition', 'Sales Channel', 'Mileage (K KM)']
# Convert each column to category individually
for col in categorical_columns_numeric_downcasting_dask:
    df_dask[col] = df_dask[col].astype('category')
    
# Record end time, memory, and CPU usage
end_time_numeric_downcasting_dask = time.time()
end_cpu_time_numeric_downcasting_dask = psutil.cpu_times().user  # End CPU time
current_mem_numeric_downcasting_dask, peak_mem_numeric_downcasting_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_numeric_downcasting_dask = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_numeric_downcasting_dask = end_time_numeric_downcasting_dask - start_time_numeric_downcasting_dask  # Wall time in seconds
cpu_time_numeric_downcasting_dask = (end_cpu_time_numeric_downcasting_dask - start_cpu_time_numeric_downcasting_dask) * 1000  # CPU time in ms
throughput_numeric_downcasting_dask = df_dask.shape[0].compute() / elapsed_time_numeric_downcasting_dask if elapsed_time_numeric_downcasting_dask > 0 else 0

# Output preview
print("✅ Numeric downcasting and categorical conversion completed")
print(df_dask.head(3))

# Performance summary
print("\n============== Performance Summary ==============")
print(f"Total rows processed   : {df_dask.shape[0].compute()}")
print(f"Wall Time (Elapsed)    : {elapsed_time_numeric_downcasting_dask:.4f} seconds")  # Wall time
print(f"CPU Time               : {cpu_time_numeric_downcasting_dask:.0f} ms")  # CPU time
print(f"CPU Usage             : {cpu_usage_numeric_downcasting_dask:.2f}%")  # CPU Usage
print(f"Throughput            : {throughput_numeric_downcasting_dask:.2f} rows/sec")  # Throughput
print(f"Current Memory (Python): {current_mem_numeric_downcasting_dask / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_numeric_downcasting_dask / 1e6:.2f} MB")
print("=================================================")

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


✅ Numeric downcasting and categorical conversion completed


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


                                             Car Name Car Brand  \
3              2021 Honda City 1.5 V i-VTEC Hatchback     Honda   
4   2022 Toyota Corolla Cross 1.8 V SUV Full Servi...    Toyota   
26  2014 Inokom Elantra 1.6 Sedan HIGH SPEC (A) NE...    Inokom   

        Car Model  Manufacture Year  Body Type                Fuel Type  \
3            City              2021  Hatchback  Petrol - Unleaded (ULP)   
4   Corolla Cross              2022        SUV  Petrol - Unleaded (ULP)   
26        Elantra              2014      Sedan  Petrol - Unleaded (ULP)   

   Mileage (K KM) Transmission   Color  Price (RM)  Installment (RM)  \
3         90 - 95    Automatic  Silver       67000               869   
4         80 - 85    Automatic   White       98999              1283   
26      100 - 105    Automatic   White       22999               298   

   Condition                    Location Sales Channel  Seat Capacity  
3       Used           Johor, Ulu Tiram    Sales Agent              5

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Total rows processed   : 160794
Wall Time (Elapsed)    : 0.0840 seconds
CPU Time               : 78 ms
CPU Usage             : 19.40%
Throughput            : 1913473.14 rows/sec
Current Memory (Python): 0.34 MB
Peak Memory (Python)   : 0.36 MB


In [100]:
df_dask.dtypes

Car Name            string[pyarrow]
Car Brand                  category
Car Model                  category
Manufacture Year              int16
Body Type                  category
Fuel Type                  category
Mileage (K KM)             category
Transmission               category
Color                      category
Price (RM)                    int32
Installment (RM)              int32
Condition                  category
Location            string[pyarrow]
Sales Channel              category
Seat Capacity                  int8
dtype: object

In [101]:
# Calculate memory usage after the conversion
end_mem_check_dask = df_dask.memory_usage().sum().compute() / 1024**2  # Convert bytes to megabytes
print(f"Memory usage after optimization: {end_mem_check_dask:.2f} MB")

# Print the memory saved
print(f"Memory saved: {start_mem_check_dask - end_mem_check_dask:.2f} MB")

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Memory usage after optimization: 21.21 MB
Memory saved: 22.23 MB


## 3. Export Cleaned Data to MongoDB

The cleaned dataset for each library is inserted into the existing MongoDB collection without overwriting previous records.

**1. Pandas**

In [105]:
# Setup
process_export_pandas = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_export_pandas = time.perf_counter()
start_cpu_time_export_pandas = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_export_pandas = psutil.cpu_percent(interval=None)

# ===============================================
# === Insert cleaned DataFrame into MongoDB ===
client_export_pandas = MongoClient("mongodb://localhost:27017")
db_export_pandas = client_export_pandas["carlist_db"]

df_pandas_carlist = df_pandas.copy()  # Rename cleaned DataFrame

# Create a new collection for the cleaned data
cleaned_collection_export_pandas = db_export_pandas["listings_pandas_cleaned"]

# Insert cleaned data
cleaned_collection_export_pandas.insert_many(df_pandas_carlist.to_dict(orient="records"))
# ===============================================

# Record end time, memory, and CPU usage
end_time_export_pandas = time.perf_counter()
end_cpu_time_export_pandas = psutil.cpu_times().user  # End CPU time
current_mem_export_pandas, peak_mem_export_pandas = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_export_pandas = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_export_pandas = end_time_export_pandas - start_time_export_pandas
cpu_time_export_pandas = (end_cpu_time_export_pandas - start_cpu_time_export_pandas) * 1000  # CPU time in ms
total_rows_export_pandas = len(df_pandas_carlist)
throughput_export_pandas = total_rows_export_pandas / elapsed_time_export_pandas if elapsed_time_export_pandas > 0 else 0

# Output confirmation
print("✅ Cleaned data (df_pandas_carlist) inserted into MongoDB collection 'listings_pandas_cleaned' successfully.")

# Performance summary
print("\n============== Export Performance Summary ==============")
print(f"Total rows exported    : {total_rows_export_pandas}")
print(f"Wall Time (Elapsed)    : {elapsed_time_export_pandas:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_export_pandas:.0f} ms")            # CPU time
print(f"CPU Usage              : {cpu_usage_export_pandas:.2f}%")            # CPU Usage
print(f"Throughput             : {throughput_export_pandas:.2f} rows/sec")   # Throughput
print(f"Current Memory (Python): {current_mem_export_pandas / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_export_pandas / 1e6:.2f} MB")
print("========================================================")

✅ Cleaned data (df_pandas_carlist) inserted into MongoDB collection 'listings_pandas_cleaned' successfully.

Total rows exported    : 160794
Wall Time (Elapsed)    : 18.4362 seconds
CPU Time               : 31219 ms
CPU Usage              : 19.90%
Throughput             : 8721.67 rows/sec
Current Memory (Python): 14.62 MB
Peak Memory (Python)   : 184.51 MB


**2. Polars**

In [106]:
# Setup
process_export_polars = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_export_polars = time.perf_counter()
start_cpu_time_export_polars = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_export_polars = psutil.cpu_percent(interval=None)

# ===============================================
# === Insert cleaned DataFrame into MongoDB ===
client_export_polars = MongoClient("mongodb://localhost:27017")
db_export_polars = client_export_polars["carlist_db"]

df_polars_carlist = df_polars.clone()  # Rename cleaned DataFrame

# Create a new collection for the cleaned data
cleaned_collection_export_polars = db_export_polars["listings_polars_cleaned"]

# Insert cleaned data
cleaned_collection_export_polars.insert_many(df_polars_carlist.to_dicts())
# ===============================================

# Record end time, memory, and CPU usage
end_time_export_polars = time.perf_counter()
end_cpu_time_export_polars = psutil.cpu_times().user  # End CPU time
current_mem_export_polars, peak_mem_export_polars = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_export_polars = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_export_polars = end_time_export_polars - start_time_export_polars
cpu_time_export_polars = (end_cpu_time_export_polars - start_cpu_time_export_polars) * 1000  # CPU time in ms
total_rows_export_polars = len(df_polars_carlist)
throughput_export_polars = total_rows_export_polars / elapsed_time_export_polars if elapsed_time_export_polars > 0 else 0

# Output confirmation
print("✅ Cleaned data (df_polars_carlist) inserted into MongoDB collection 'listings_polars_cleaned' successfully.")

# Performance summary
print("\n============== Export Performance Summary ==============")
print(f"Total rows exported    : {total_rows_export_polars}")
print(f"Wall Time (Elapsed)    : {elapsed_time_export_polars:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_export_polars:.0f} ms")            # CPU time
print(f"CPU Usage              : {cpu_usage_export_polars:.2f}%")            # CPU Usage
print(f"Throughput             : {throughput_export_polars:.2f} rows/sec")   # Throughput
print(f"Current Memory (Python): {current_mem_export_polars / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_export_polars / 1e6:.2f} MB")
print("========================================================")


✅ Cleaned data (df_polars_carlist) inserted into MongoDB collection 'listings_polars_cleaned' successfully.

Total rows exported    : 160794
Wall Time (Elapsed)    : 11.4875 seconds
CPU Time               : 14891 ms
CPU Usage              : 14.40%
Throughput             : 13997.33 rows/sec
Current Memory (Python): 0.50 MB
Peak Memory (Python)   : 269.53 MB


**3. Modin**

In [107]:
# Setup
process_export_modin = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_export_modin = time.perf_counter()
start_cpu_time_export_modin = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_export_modin = psutil.cpu_percent(interval=None)

# ===============================================
# === Insert cleaned DataFrame into MongoDB ===
client_export_modin = MongoClient("mongodb://localhost:27017")
db_export_modin = client_export_modin["carlist_db"]

df_modin_carlist = df_modin.copy()  # Rename cleaned DataFrame

# Create a new collection for the cleaned data
cleaned_collection_export_modin = db_export_modin["listings_modin_cleaned"]

# Insert cleaned data
cleaned_collection_export_modin.insert_many(df_modin_carlist.to_dict(orient="records"))
# ===============================================

# Record end time, memory, and CPU usage
end_time_export_modin = time.perf_counter()
end_cpu_time_export_modin = psutil.cpu_times().user  # End CPU time
current_mem_export_modin, peak_mem_export_modin = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_export_modin = psutil.cpu_percent(interval=1)  # sampled over 1 second

# Calculate metrics
elapsed_time_export_modin = end_time_export_modin - start_time_export_modin
cpu_time_export_modin = (end_cpu_time_export_modin - start_cpu_time_export_modin) * 1000  # CPU time in ms
total_rows_export_modin = len(df_modin_carlist)
throughput_export_modin = total_rows_export_modin / elapsed_time_export_modin if elapsed_time_export_modin > 0 else 0

# Output confirmation
print("✅ Cleaned data (df_modin_carlist) inserted into MongoDB collection 'listings_modin_cleaned' successfully.")

# Performance summary
print("\n============== Export Performance Summary ==============")
print(f"Total rows exported    : {total_rows_export_modin}")
print(f"Wall Time (Elapsed)    : {elapsed_time_export_modin:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_export_modin:.0f} ms")            # CPU time
print(f"CPU Usage              : {cpu_usage_export_modin:.2f}%")            # CPU Usage
print(f"Throughput             : {throughput_export_modin:.2f} rows/sec")   # Throughput
print(f"Current Memory (Python): {current_mem_export_modin / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_export_modin / 1e6:.2f} MB")
print("========================================================")




✅ Cleaned data (df_modin_carlist) inserted into MongoDB collection 'listings_modin_cleaned' successfully.

Total rows exported    : 160794
Wall Time (Elapsed)    : 19.8898 seconds
CPU Time               : 26250 ms
CPU Usage              : 13.20%
Throughput             : 8084.24 rows/sec
Current Memory (Python): 4.73 MB
Peak Memory (Python)   : 202.58 MB


**4. Dask**

In [108]:
process_export_dask = psutil.Process(os.getpid())
tracemalloc.start()

# Record start time and CPU usage
start_time_export_dask = time.perf_counter()
start_cpu_time_export_dask = psutil.cpu_times().user  # Start CPU time
start_cpu_percent_export_dask = psutil.cpu_percent(interval=None)  # Initial CPU usage

# ===============================================
# === Insert cleaned Dask DataFrame into MongoDB ===

client_export_dask = MongoClient("mongodb://localhost:27017")
db_export_dask = client_export_dask["carlist_db"]

# Compute Dask DataFrame to Pandas DataFrame for MongoDB insertion
df_dask_carlist = df_dask.compute()  # Convert Dask DataFrame to Pandas DataFrame

# Create a new collection for the cleaned data
cleaned_collection_export_dask = db_export_dask["listings_dask_cleaned"]

# Insert cleaned data into MongoDB
cleaned_collection_export_dask.insert_many(df_dask_carlist.to_dict(orient="records"))
# ===============================================

# Record end time, memory, and CPU usage
end_time_export_dask = time.perf_counter()
end_cpu_time_export_dask = psutil.cpu_times().user  # End CPU time
current_mem_export_dask, peak_mem_export_dask = tracemalloc.get_traced_memory()
tracemalloc.stop()

# CPU percent used during execution
cpu_usage_export_dask = psutil.cpu_percent(interval=1)  # Sampled over 1 second

# Calculate metrics
elapsed_time_export_dask = end_time_export_dask - start_time_export_dask
cpu_time_export_dask = (end_cpu_time_export_dask - start_cpu_time_export_dask) * 1000  # CPU time in ms
total_rows_export_dask = len(df_dask_carlist)
throughput_export_dask = total_rows_export_dask / elapsed_time_export_dask if elapsed_time_export_dask > 0 else 0

# Output confirmation
print("✅ Cleaned data (df_dask_carlist) inserted into MongoDB collection 'listings_dask_cleaned' successfully.")

# Performance summary
print("\n============== Export Performance Summary ==============")
print(f"Total rows exported    : {total_rows_export_dask}")
print(f"Wall Time (Elapsed)    : {elapsed_time_export_dask:.4f} seconds")  # Time
print(f"CPU Time               : {cpu_time_export_dask:.0f} ms")            # CPU time
print(f"CPU Usage              : {cpu_usage_export_dask:.2f}%")            # CPU Usage
print(f"Throughput             : {throughput_export_dask:.2f} rows/sec")   # Throughput
print(f"Current Memory (Python): {current_mem_export_dask / 1e6:.2f} MB")  # Memory
print(f"Peak Memory (Python)   : {peak_mem_export_dask / 1e6:.2f} MB")
print("========================================================")

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


✅ Cleaned data (df_dask_carlist) inserted into MongoDB collection 'listings_dask_cleaned' successfully.

Total rows exported    : 160794
Wall Time (Elapsed)    : 41.0053 seconds
CPU Time               : 60609 ms
CPU Usage              : 9.80%
Throughput             : 3921.30 rows/sec
Current Memory (Python): 24.40 MB
Peak Memory (Python)   : 344.57 MB
