# Preprocessing

In [15]:
# import library
import pandas as pd
from datetime import datetime
import numpy as np

In [3]:
raw_df = pd.read_csv('data/crawl_data.csv')

size of data


In [4]:
raw_df.shape

(2293, 46)

<h1> Brief introduction of data? What is the mean of each data properties/rows? </h1>

In [5]:
#TODO
raw_df.columns

Index(['Unnamed: 0', 'CARNAME', 'ID', 'Make', 'Model', 'Body color',
       'Type of finish', 'Interior color', 'Interior material', 'Body',
       'Doors', 'Seats', 'VIN', 'Fuel', 'Transmission', 'Drive type', 'Power',
       'El. motor power', 'CO2 emissions', 'Emission class', 'Battery type',
       'AC charging time', 'DC charging time', 'Battery warranty (km)',
       'Range extender', 'Mileage', 'First registration', 'Condition',
       'Consumption', 'Price', 'Currency', 'Tags', 'Engine capacity',
       'Valid MOT until', 'Previous owners', 'Engine power',
       'Battery capacity', 'Hybrid type', 'Electric range', 'Warranty until',
       'Weight', 'Country of origin', 'Secondary drive', 'Energy efficiency',
       'Full service history', 'Battery capacity1'],
      dtype='object')


<h2> Meaning of each column </h2>

- **Unnamed: 0**: the index of original data
- **CARNAME**: name of the car
- **ID**: id of the car
- **Make**: brand that the car is made
- **Model**: model of the car
- **Body color**: body/exterior color of the car
- **Type of finish**: the paint finish or coating applied to the vehicle. 
    - Ex: Metallic finish is a type of paint that contains small metallic particles, typically aluminum flakes, which give the paint a shiny and reflective appearance.
- **Interior color**: color of the car's interior
- **Interior material**: materials used to construct the interior components of the vehicle, typically Leather,Cloth,Alcantara(high-end) 
- **Body**: body style or the overall design and structure of the vehicle (Sedan, SUV, MPV,...)
- **Doors**: number of doors on a vehicle. 4/5 means that a car has 4 main doors (two front doors and two rear doors) and a rear hatch/liftgate
- **Seats**: number of seating positions available for occupants.
- **VIN**: (Vehicle Identification Number) - A unique code assigned to every motor vehicle when it's manufactured, used for identification purposes.
- **Fuel**: the type of fuel the vehicle uses, such as gasoline, diesel, electric, hybrid, etc
- **Transmission**: The type of transmission system the vehicle has, such as automatic, manual, or semi-automatic.
- **Drive type**: Specifies whether the vehicle is front-wheel drive (FWD), rear-wheel drive (RWD), or all-wheel drive (AWD).
- **Power**:  total power output of the vehicle's engine or powertrain, often measured in horsepower (hp) or kilowatts (kW).
- **El. motor power**: The power output of the electric motor in electric or hybrid vehicles.
- **CO2 emissions**: The amount of carbon dioxide emitted by the vehicle, measured in grams per kilometer (g/km).
- **Emission class**: The vehicle's emission standard, indicating its compliance with environmental regulations.
- **Battery type**: The type of battery used in electric or hybrid vehicles, such as lithium-ion.
- **AC charging time**: The time (hour) it takes to charge the vehicle's battery using alternating current (AC).
- **DC charging time**: The time (hour) it takes to charge the vehicle's battery using direct current (DC).
- **Battery warranty (km)**: The distance (in kilometers) covered by the warranty for the vehicle's battery.
- **Range extender**:  A feature in some electric vehicles that includes a backup power source (usually a small internal combustion engine) to extend the vehicle's range.
- **Mileage**: The total distance the vehicle has traveled, often measured in miles or kilometers.
- **First registration**: The date when the vehicle was first registered.
- **Condition**: The overall state or condition of the vehicle, such as new or used.
- **Consumption**: The fuel or energy consumption of the vehicle, often expressed in liters per 100 kilometers or miles per gallon.
- **Price**: The selling price of the vehicle.
- **Currency**: The currency in which the vehicle's price is quoted.
- **Tags**: Special features/function that are equipped on the car.
- **Engine capacity**: The total volume of all the cylinders in the engine (ccm).
- **Valid MOT until**: The date until which the vehicle's Ministry of Transport (MOT) certification is valid.
- **Previous owners**: The number of individuals or entities that have owned the vehicle before the current owner.
- **Engine power**: The power output of the vehicle's electric engine.
- **Battery capacity**: The total energy storage capacity of the vehicle's battery (kWh).
- **Hybrid type**: The specific type of hybrid technology employed by the vehicle, such as parallel hybrid or series hybrid.
- **Electric range**: The distance the vehicle can travel on electric power alone.
- **Warranty until**: The date until which the vehicle is covered by a warranty.
- **Weight**: The total weight of the vehicle, often measured in kilograms or pounds.
- **Country of origin**: The country where the vehicle was manufactured.
- **Secondary drive**: Additional features related to the vehicle's drive system, such as a secondary electric motor in hybrid vehicles.
- **Energy efficiency**:  The efficiency of the vehicle in converting energy into motion.
- **Full service history**: Documentation of all the services and maintenance performed on the vehicle.
- **Battery capacity1**: The total energy storage capacity of the vehicle's battery (Ah) (Ampere-hour).

<h1> Deduplicates </h1>

In [37]:
#TODO
duplicate_rows = raw_df[raw_df.duplicated()]
# print duplicate rows
print(duplicate_rows)

Empty DataFrame
Columns: [Unnamed: 0, CARNAME, ID, Make, Model, Body color, Type of finish, Interior color, Interior material, Body, Doors, Seats, VIN, Fuel, Transmission, Drive type, Power, El. motor power, CO2 emissions, Emission class, Battery type, AC charging time, DC charging time, Battery warranty (km), Range extender, Mileage, First registration, Condition, Consumption, Price, Currency, Tags, Engine capacity, Valid MOT until, Previous owners, Engine power, Battery capacity, Hybrid type, Electric range, Warranty until, Weight, Country of origin, Secondary drive, Energy efficiency, Full service history, Battery capacity1]
Index: []

[0 rows x 46 columns]


- As we can see, there is no duplicated rows. Because we get data for different links.

# Analyzing Data in Columns
Next, we will analyze and process the data in the columns of the crawled dataset.

- First, we calculate the data type (dtype) of each column in DataFrame `raw_df` and save the result into Series `dtypes` (This Series has the index as the column name).

In [6]:
#TODO
dtypes = raw_df.dtypes
dtypes

Unnamed: 0                 int64
CARNAME                   object
ID                         int64
Make                      object
Model                     object
Body color                object
Type of finish            object
Interior color            object
Interior material         object
Body                      object
Doors                     object
Seats                     object
VIN                       object
Fuel                      object
Transmission              object
Drive type                object
Power                     object
El. motor power           object
CO2 emissions             object
Emission class            object
Battery type              object
AC charging time          object
DC charging time          object
Battery warranty (km)     object
Range extender            object
Mileage                   object
First registration        object
Condition                 object
Consumption               object
Price                      int64
Currency  

- We need to check the data in the columns of the dataset.

In [7]:
df = raw_df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 46 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             2293 non-null   int64  
 1   CARNAME                2293 non-null   object 
 2   ID                     2293 non-null   int64  
 3   Make                   2293 non-null   object 
 4   Model                  2293 non-null   object 
 5   Body color             2111 non-null   object 
 6   Type of finish         1204 non-null   object 
 7   Interior color         1894 non-null   object 
 8   Interior material      1930 non-null   object 
 9   Body                   2293 non-null   object 
 10  Doors                  2272 non-null   object 
 11  Seats                  2080 non-null   object 
 12  VIN                    2293 non-null   object 
 13  Fuel                   2293 non-null   object 
 14  Transmission           2293 non-null   object 
 15  Driv

- There are many columns with more than *70%* missing data, such as `Engine power`, `Full service history`, `Hybrid type`, ... Therefore, the columns with a high percentage of missing data will be removed before processing the data.

In [8]:
missing_column = ['Engine power','Full service history', 
                  'Hybrid type', 'El. motor power',
                  'Electric range', 'Valid MOT until',
                  'Energy efficiency', 'Secondary drive',
                  'Range extender','Battery capacity1']
df = df.drop(missing_column, axis = 1)

- Removing unnecessary columns:
    - `Unnamedd:0`0: the index number    - `Currency`cy: the currency unit, which is EUR for all entrie- s
Therefore, we will also remove these columns from the dataset.

In [9]:
df = df.drop(['Currency','Unnamed: 0'], axis = 1)

- After removing the columns with missing data and unnecessary columns, we need to process the data in the remaining columns. In this section, we will proceed with data processing in these columns.

1. What is the current data type of each column? Are there columns having inappropriate data types? If have, converting

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CARNAME                2293 non-null   object 
 1   ID                     2293 non-null   int64  
 2   Make                   2293 non-null   object 
 3   Model                  2293 non-null   object 
 4   Body color             2111 non-null   object 
 5   Type of finish         1204 non-null   object 
 6   Interior color         1894 non-null   object 
 7   Interior material      1930 non-null   object 
 8   Body                   2293 non-null   object 
 9   Doors                  2272 non-null   object 
 10  Seats                  2080 non-null   object 
 11  VIN                    2293 non-null   object 
 12  Fuel                   2293 non-null   object 
 13  Transmission           2293 non-null   object 
 14  Drive type             2293 non-null   object 
 15  Powe

- There are some columns with invalid data types.
- The columns `Engine capacity`, `Power`, `Battery warranty (km)`, `Mileage`, `Battery capacity`, `CO2 emissions`, `AC charging time`, `DC charging time` and `Seats` should be converted to `int` or `float` data types instead of `string`. Therefore, we will process the data strings to convert them into numerical data types.

In [11]:
#TODO
column_convert_to_numeric = ['Engine capacity','Power',
                         'Battery warranty (km)','Mileage',
                         'Battery capacity', 'CO2 emissions', 
                         'AC charging time', 'DC charging time']
def convert_to_int(column):
    for i in range(len(column)):
        value = column[i]
        if pd.notnull(value):
            column[i] = value.split(' ')[0].strip().replace('\xa0', '')
    column = pd.to_numeric(column, errors='coerce')
    return column
    
# Convert specified columns 
df[column_convert_to_numeric] = df[column_convert_to_numeric].apply(convert_to_int)
# Convert 'Seats' column 
df['Seats'] = pd.to_numeric(df['Seats'], errors='coerce')

In [12]:
# Rename the columns
new_column_names = {'Engine capacity': 'Engine capacity(ccm)',
                    'Price': 'Price(EUR)',
                    'Power': 'Power(kW)',
                    'Battery warranty (km)': 'Battery warranty(km)',
                    'Mileage': 'Mileage(km)',
                    'Battery capacity': 'Battery capacity(kWh)',
                    'CO2 emissions': 'CO2 emissions(g/km)',
                    'AC charging time': 'AC charging time(h)',
                    'DC charging time': 'DC charging time(min)'}
df = df.rename(columns=new_column_names)

Process the columns that store month/year data as strings to datetime: 
- First registration
- Warranty until

In [16]:
def convert_to_datetime(date_str):
    if pd.notnull(date_str):
        date_format = "%m/%Y"
        date_obj = datetime.strptime(date_str, date_format)
        return date_obj
    else:
        return pd.NaT
convert_column = ['First registration', 'Warranty until']
df[convert_column] = df[convert_column].applymap(convert_to_datetime)

In [13]:
# Xử lí cột consumption
...

In [17]:
# After converting data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   CARNAME                2293 non-null   object        
 1   ID                     2293 non-null   int64         
 2   Make                   2293 non-null   object        
 3   Model                  2293 non-null   object        
 4   Body color             2111 non-null   object        
 5   Type of finish         1204 non-null   object        
 6   Interior color         1894 non-null   object        
 7   Interior material      1930 non-null   object        
 8   Body                   2293 non-null   object        
 9   Doors                  2272 non-null   object        
 10  Seats                  2066 non-null   float64       
 11  VIN                    2293 non-null   object        
 12  Fuel                   2293 non-null   object        
 13  Tra

- Now the data types of the columns are more appropriate. The next step is to examine the distribution of data in the numerical and categorical columns.

2. With each numerical column, how are values distributed?
- What is the percentage of missing values?
- Handling missing values
- Min? max? Are they abnormal?
- Missing value treatment

- Filter out the columns that contain numerical data. Calculate the missing data rate for each column.

In [15]:
numeric_df = df.select_dtypes(include=np.number)
missing_percentage = numeric_df.isnull().mean()

print('The percentage of missing values:')
for idx, missing in zip(missing_percentage.index, missing_percentage): 
    print(f'- {idx}: {round(missing * 100, 2)}%')

The percentage of missing values:
- Weight: 71.34%
- Engine capacity(ccm): 10.74%
- Price(EUR): 0.0%
- Power(kW): 0.0%
- Previous owners: 65.97%
- ID: 0.0%
- Battery warranty(km): 95.81%
- Mileage(km): 0.0%
- Battery capacity(kWh): 92.27%
- CO2 emissions(g/km): 0.0%
- AC charging time(h): 95.09%
- DC charging time(min): 97.82%
- Seats: 10.56%


- The columns `Battery warranty (km)`, `Battery capacity`, `AC charging time` và `DC charging time` have a high rate of missing data due to the fact that these attributes are only applicable to electric cars. Therefore, it is recommended to keep the data as it is in these columns.
- For the column `Previous owners`, which indicates whether the car has been previously owned or not, the missing values can be replaced with 0, assuming that the missing values correspond to cars that are brand new.
- The missing values in the columnst `Seats` and `Engine capacity`will be replaced with the median value.
- The missing values in the column `Weight`  will be filled with the mean value.

In [16]:
df['Previous owners'] = df['Previous owners'].fillna(0)

# Xử lí missing values cột Seats
median_seats = df['Seats'].median()
df['Seats'] = df['Seats'].fillna(median_seats)

# Xử lí missing values cột Engine capacity
median_engine_capacity = df['Engine capacity(ccm)'].median()
df['Engine capacity(ccm)'] = df['Engine capacity(ccm)'].fillna(median_engine_capacity)

# Xử lí missing values cột Weight
mean_weight = df['Weight'].mean().round(2)
df['Weight'] = df['Weight'].fillna(mean_weight)

- Calculate the missing values again and save them into the `missing_percentages` variable.

In [17]:
numeric_df = df.select_dtypes(include=np.number)
missing_percentage = numeric_df.isnull().mean()

print('The percentage of missing values:')
for idx, missing in zip(missing_percentage.index, missing_percentage): 
    print(f'- {idx}: {round(missing * 100, 2)}%')

The percentage of missing values:
- Weight: 0.0%
- Engine capacity(ccm): 0.0%
- Price(EUR): 0.0%
- Power(kW): 0.0%
- Previous owners: 0.0%
- ID: 0.0%
- Battery warranty(km): 95.81%
- Mileage(km): 0.0%
- Battery capacity(kWh): 92.27%
- CO2 emissions(g/km): 0.0%
- AC charging time(h): 95.09%
- DC charging time(min): 97.82%
- Seats: 0.0%


- Now let's examine the distribution of data in the numerical columns. We will calculate percentiles with values of 0% (min), 25%, 50%, 75%, and 100% (max) to see how the data is distributed. The results will be saved in the `numeric_col_profile` variable.

In [18]:
numeric_columns = df.select_dtypes(include=np.number).columns
numeric_col_profile = df[numeric_columns].describe().loc[["min", "25%", "50%", "75%", "max"]]
numeric_col_profile 

Unnamed: 0,Weight,Engine capacity(ccm),Price(EUR),Power(kW),Previous owners,ID,Battery warranty(km),Mileage(km),Battery capacity(kWh),CO2 emissions(g/km),AC charging time(h),DC charging time(min),Seats
min,830.0,1.0,5399.0,33.0,0.0,61021609.0,100000.0,0.0,1.0,0.0,1.0,22.0,2.0
25%,1597.77,1332.0,20474.0,85.0,0.0,61022011.5,120000.0,200.0,12.0,97.0,3.0,29.5,5.0
50%,1597.77,1499.0,28399.0,107.0,0.0,61022427.0,160000.0,26788.0,16.0,117.0,4.0,32.0,5.0
75%,1597.77,1968.0,41749.0,145.0,1.0,61022849.0,160000.0,62909.5,42.0,135.0,5.0,34.5,5.0
max,2676.0,6496.0,394349.0,588.0,4.0,61023291.0,240000.0,174215.0,319.0,356.0,12.0,55.0,7.0


- The distribution of the numerical columns seems to be normal, except for the minimum value of the`Engine capacity(ccm)` column which appears to be a bit unusual.

With each categorical column, how are values distributed?
- What is the percentage of missing values?
- How many different values? Are they abnormal?

In [19]:
#TODO

Dữ liệu có hợp lí hay không?

In [20]:
#TODO

Outlier detection and treatment.