# Alonhadat Data Preprocessing
This notebook contains the steps to preprocess the alonhadat.csv dataset.
The steps are derived from the exploration phase and focus on cleaning and transforming the data for modeling.

## 1. Imports
Import necessary libraries for data manipulation and regular expressions.

In [1]:
import pandas as pd
import numpy as np
import re
# from datetime import datetime # Import if date processing is added later
# import warnings
# warnings.filterwarnings('ignore') # Uncomment if needed

## 2. Load Data
Load the raw dataset.

In [2]:
df = pd.read_csv('../Data Collection/Datasets/alonhadat.com/raw/alonhadat.csv')
print("Original DataFrame shape:", df.shape)
df.head()

Original DataFrame shape: (38960, 8)


Unnamed: 0,address,area,bedrooms,date,floors,price,title,url
0,"Đường Nguyễn Văn Cừ, Phường Gia Thụy, Quận Lon...",80 m\n2,1 phòng ngủ,Hôm nay,1 lầu,"7,5 tỷ","🥇ĐẤT NGUYỄN VĂN CỪ 80M, MT8M, MẢNH ĐẤT RỘNG TH...",https://alonhadat.com.vnhttps://alonhadat.com....
1,"Đường Ngọc Lâm, Phường Ngọc Lâm, Quận Long Biê...",36 m\n2,3 phòng ngủ,Hôm nay,6 lầu,"8,65 tỷ","🔥CÒN DUY NHẤT 1 CĂN GIÁ RẺ, NGỌC LÂM 36M, 6T G...",https://alonhadat.com.vnhttps://alonhadat.com....
2,"Đường Ngô Gia Tự, Phường Đức Giang, Quận Long ...",56 m\n2,1 phòng ngủ,Hôm nay,1 lầu,"15,5 tỷ","👉MẶT PHỐ, NGÔ GIA TỰ, 56M, MT4M, VỈA HÈ ĐÁ BÓN...",https://alonhadat.com.vnhttps://alonhadat.com....
3,"Đường Phúc Lợi, Phường Phúc Lợi, Quận Long Biê...",32 m\n2,3 phòng ngủ,Hôm nay,5 lầu,"5,2 tỷ","🥇CĂN DUY NHẤT, NGÕ THÔNG, Ô TÔ , LÔ GÓC, PHÚC ...",https://alonhadat.com.vnhttps://alonhadat.com....
4,"Phố Lệ Mật, Phường Việt Hưng, Quận Long Biên, ...",58 m\n2,3 phòng ngủ,Hôm nay,3 lầu,7 tỷ,"🏡VIỆT HƯNG, DIỆN TÍCH RỘNG 58m, 3T, MT5m GIÁ C...",https://alonhadat.com.vnhttps://alonhadat.com....


## 3. Initial Cleaning and Column Selection
Select relevant columns for the analysis and remove duplicate rows.

In [3]:
# Choosing relevant columns
df = df[['address', 'area', 'bedrooms', 'date', 'floors', 'price', 'title']]
print("Shape after column selection:", df.shape)

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Duplicates removed. New shape: {df.shape}")
df.head()

Shape after column selection: (38960, 7)
Number of duplicate rows: 15325
Duplicates removed. New shape: (23635, 7)


Unnamed: 0,address,area,bedrooms,date,floors,price,title
0,"Đường Nguyễn Văn Cừ, Phường Gia Thụy, Quận Lon...",80 m\n2,1 phòng ngủ,Hôm nay,1 lầu,"7,5 tỷ","🥇ĐẤT NGUYỄN VĂN CỪ 80M, MT8M, MẢNH ĐẤT RỘNG TH..."
1,"Đường Ngọc Lâm, Phường Ngọc Lâm, Quận Long Biê...",36 m\n2,3 phòng ngủ,Hôm nay,6 lầu,"8,65 tỷ","🔥CÒN DUY NHẤT 1 CĂN GIÁ RẺ, NGỌC LÂM 36M, 6T G..."
2,"Đường Ngô Gia Tự, Phường Đức Giang, Quận Long ...",56 m\n2,1 phòng ngủ,Hôm nay,1 lầu,"15,5 tỷ","👉MẶT PHỐ, NGÔ GIA TỰ, 56M, MT4M, VỈA HÈ ĐÁ BÓN..."
3,"Đường Phúc Lợi, Phường Phúc Lợi, Quận Long Biê...",32 m\n2,3 phòng ngủ,Hôm nay,5 lầu,"5,2 tỷ","🥇CĂN DUY NHẤT, NGÕ THÔNG, Ô TÔ , LÔ GÓC, PHÚC ..."
4,"Phố Lệ Mật, Phường Việt Hưng, Quận Long Biên, ...",58 m\n2,3 phòng ngủ,Hôm nay,3 lầu,7 tỷ,"🏡VIỆT HƯNG, DIỆN TÍCH RỘNG 58m, 3T, MT5m GIÁ C..."


## 4. Address Processing
Extract structured information (road, ward, district) from the 'address' column and create categorical features.

In [4]:
# Function to extract road, ward, and district from address
def extract_address_components(address):
    road, ward, district = None, None, None
    if isinstance(address, str):
        road_pattern = r'(?:Đường|Phố|Ngõ|Hẻm|Đại lộ|Tỉnh Lộ|Quốc lộ|QL|TL)\s+([^,]+)'
        ward_pattern = r'(?:Phường|Xã|Thị trấn|P\.|X\.|TT\.)\s+([^,]+)'
        district_pattern = r'(?:Quận|Huyện|Thị xã|Thành phố|Q\.|H\.)\s+([^,\.]+)'
        
        road_match = re.search(road_pattern, address, re.IGNORECASE)
        ward_match = re.search(ward_pattern, address, re.IGNORECASE)
        district_match = re.search(district_pattern, address, re.IGNORECASE)
        
        if road_match:
            road = road_match.group(1).strip()
        if ward_match:
            ward = ward_match.group(1).strip()
        if district_match:
            district = district_match.group(1).strip()
            
    return road, ward, district

# Apply the function to create new columns
df[['road', 'ward', 'district']] = df['address'].apply(
    lambda x: pd.Series(extract_address_components(x))
)

# Create a column for complete extraction
df['address_complete'] = df[['road', 'ward', 'district']].notnull().all(axis=1).astype(int)

# Convert address components to categorical codes
df['road_cat'] = df['road'].astype('category').cat.codes
df['ward_cat'] = df['ward'].astype('category').cat.codes
df['district_cat'] = df['district'].astype('category').cat.codes

print("Address components extracted and categorical codes created.")
df[['address', 'road', 'ward', 'district', 'address_complete', 'road_cat', 'ward_cat', 'district_cat']].head()

Address components extracted and categorical codes created.


Unnamed: 0,address,road,ward,district,address_complete,road_cat,ward_cat,district_cat
0,"Đường Nguyễn Văn Cừ, Phường Gia Thụy, Quận Lon...",Nguyễn Văn Cừ,Gia Thụy,Long Biên,1,564,54,11
1,"Đường Ngọc Lâm, Phường Ngọc Lâm, Quận Long Biê...",Ngọc Lâm,Ngọc Lâm,Long Biên,1,603,157,11
2,"Đường Ngô Gia Tự, Phường Đức Giang, Quận Long ...",Ngô Gia Tự,Đức Giang,Long Biên,1,587,390,11
3,"Đường Phúc Lợi, Phường Phúc Lợi, Quận Long Biê...",Phúc Lợi,Phúc Lợi,Long Biên,1,667,193,11
4,"Phố Lệ Mật, Phường Việt Hưng, Quận Long Biên, ...",Lệ Mật,Việt Hưng,Long Biên,1,445,313,11


## 5. Numeric Feature Conversion
Convert 'area', 'bedrooms', and 'floors' columns to numeric types, handling potential errors.

In [5]:
# --- Area ---
df['area'] = pd.to_numeric(df['area'].astype(str).str.extract(r'(\d+(?:[.,]\d+)?)', expand=False), errors='coerce')
# --- Bedrooms ---
df['bedrooms'] = pd.to_numeric(df['bedrooms'].astype(str).str.extract(r'(\d+)', expand=False), errors='coerce').astype('Int64')
# --- Floors ---
df['floors'] = pd.to_numeric(df['floors'].astype(str).str.extract(r'(\d+)', expand=False), errors='coerce').astype('Int64')

print("Numeric features converted.")
df[['area', 'bedrooms', 'floors']].info()
df[['area', 'bedrooms', 'floors']].head()

Numeric features converted.
<class 'pandas.core.frame.DataFrame'>
Index: 23635 entries, 0 to 38959
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   area      23635 non-null  float64
 1   bedrooms  20260 non-null  Int64  
 2   floors    20386 non-null  Int64  
dtypes: Int64(2), float64(1)
memory usage: 784.8 KB


Unnamed: 0,area,bedrooms,floors
0,80.0,1,1
1,36.0,3,6
2,56.0,1,1
3,32.0,3,5
4,58.0,3,3


## 6. Price Processing
Parse the 'price' column, converting string representations (e.g., "X tỷ", "Y triệu") into a standardized numeric 'price_converted' column (in millions).

In [6]:
def parse_price(price_str):
    if pd.isna(price_str):
        return np.nan
    
    price_str_lower = str(price_str).lower()
    
    if 'thỏa thuận' in price_str_lower:
        return np.nan

    cleaned_price_str = price_str_lower.replace(',', '.')
    
    num_part_match = re.search(r'(\d+(?:\.\d+)?)', cleaned_price_str)
    if not num_part_match:
        return np.nan
        
    num_val = float(num_part_match.group(1))
    
    if 'tỷ' in price_str_lower:
        return num_val * 1000  # Convert billions to millions
    elif 'triệu' in price_str_lower:
        return num_val
    return num_val / 1e6 # Convert raw VND to millions

df['price_converted'] = df['price'].apply(parse_price)

print("Price column parsed and 'price_converted' (in millions) created.")
df[['price', 'price_converted']].head()
df['price_converted'].describe()

Price column parsed and 'price_converted' (in millions) created.


count    2.361700e+04
mean     3.118170e+04
std      6.854573e+04
min      4.000000e-06
25%      7.000000e+03
50%      1.280000e+04
75%      2.550000e+04
max      1.050000e+06
Name: price_converted, dtype: float64

## 7. Outlier Removal (Initial Columns)
Define and apply the IQR method to remove outliers from 'area', 'bedrooms', 'floors', and 'price_converted'.

In [7]:
def remove_outliers_iqr(df_in, column):
    Q1 = df_in[column].quantile(0.25)
    Q3 = df_in[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df_in[(df_in[column] >= lower_bound) & (df_in[column] <= upper_bound)]

outlier_columns = ['area', 'bedrooms', 'floors', 'price_converted']
print(f"Shape before outlier removal: {df.shape}")

for col in outlier_columns:
    if col in df.columns and df[col].notna().sum() > 0:
        df = remove_outliers_iqr(df, col)
    else:
        print(f"Skipping outlier removal for column '{col}' as it's missing or all NA.")

print(f"Shape after outlier removal from {outlier_columns}: {df.shape}")
df[outlier_columns].describe()

Shape before outlier removal: (23635, 15)
Shape after outlier removal from ['area', 'bedrooms', 'floors', 'price_converted']: (14905, 15)


Unnamed: 0,area,bedrooms,floors,price_converted
count,14905.0,14905.0,14905.0,14905.0
mean,56.899555,4.232003,4.669373,13538.931091
std,25.671906,1.678575,1.493066,8628.593353
min,1.106,1.0,1.0,3.8e-05
25%,40.0,3.0,4.0,7000.0
50%,50.0,4.0,5.0,10800.0
75%,68.0,5.0,5.0,18000.0
max,176.0,10.0,9.0,41500.0


## 8. Feature Engineering: Price per m²
Create the 'price_per_m2' feature using the cleaned 'price_converted' and 'area'.

In [8]:
# Ensure 'area' is not zero or NA before division
df = df[df['area'].notna() & (df['area'] > 0)]
df['price_per_m2'] = df['price_converted'] / df['area']

print("'price_per_m2' column created.")
df[['area', 'price_converted', 'price_per_m2']].head()
df['price_per_m2'].describe()

'price_per_m2' column created.


count    1.490500e+04
mean     2.457164e+02
std      3.038716e+02
min      2.171429e-07
25%      1.690909e+02
50%      2.187500e+02
75%      2.948052e+02
max      2.513562e+04
Name: price_per_m2, dtype: float64

## 9. Outlier Removal for Price per m²
Apply IQR outlier removal specifically to the newly created 'price_per_m2' column.

In [9]:
print(f"Shape before outlier removal for 'price_per_m2': {df.shape}")
if 'price_per_m2' in df.columns and df['price_per_m2'].notna().sum() > 0:
    df = remove_outliers_iqr(df, 'price_per_m2')
    print(f"Shape after outlier removal for 'price_per_m2': {df.shape}")
    print("\nSummary statistics for 'price_per_m2' after outlier removal:")
    print(df['price_per_m2'].describe())
else:
    print("Skipping outlier removal for 'price_per_m2' as it's missing or all NA.")

Shape before outlier removal for 'price_per_m2': (14905, 16)
Shape after outlier removal for 'price_per_m2': (14398, 16)

Summary statistics for 'price_per_m2' after outlier removal:
count    1.439800e+04
mean     2.278564e+02
std      9.321443e+01
min      2.171429e-07
25%      1.666667e+02
50%      2.156250e+02
75%      2.847882e+02
max      4.833333e+02
Name: price_per_m2, dtype: float64


## 10. Final Processed DataFrame
Display information about the final processed DataFrame.

In [10]:
print("Final DataFrame head:")
print(df.head())
print("\nFinal DataFrame info:")
df.info()
print("\nFinal DataFrame shape:", df.shape)

# Optional: Save the processed DataFrame
# df.to_csv('../Data Preprocessing/alonhadat_processed.csv', index=False)
# print("\nProcessed DataFrame saved to '../Data Preprocessing/alonhadat_processed.csv'")

Final DataFrame head:
                                             address  area  bedrooms     date  \
0  Đường Nguyễn Văn Cừ, Phường Gia Thụy, Quận Lon...  80.0         1  Hôm nay   
1  Đường Ngọc Lâm, Phường Ngọc Lâm, Quận Long Biê...  36.0         3  Hôm nay   
2  Đường Ngô Gia Tự, Phường Đức Giang, Quận Long ...  56.0         1  Hôm nay   
3  Đường Phúc Lợi, Phường Phúc Lợi, Quận Long Biê...  32.0         3  Hôm nay   
4  Phố Lệ Mật, Phường Việt Hưng, Quận Long Biên, ...  58.0         3  Hôm nay   

   floors     price                                              title  \
0       1    7,5 tỷ  🥇ĐẤT NGUYỄN VĂN CỪ 80M, MT8M, MẢNH ĐẤT RỘNG TH...   
1       6   8,65 tỷ  🔥CÒN DUY NHẤT 1 CĂN GIÁ RẺ, NGỌC LÂM 36M, 6T G...   
2       1   15,5 tỷ  👉MẶT PHỐ, NGÔ GIA TỰ, 56M, MT4M, VỈA HÈ ĐÁ BÓN...   
3       5    5,2 tỷ  🥇CĂN DUY NHẤT, NGÕ THÔNG, Ô TÔ , LÔ GÓC, PHÚC ...   
4       3      7 tỷ  🏡VIỆT HƯNG, DIỆN TÍCH RỘNG 58m, 3T, MT5m GIÁ C...   

            road       ward   district  addres

In [11]:
# Save the processed DataFrame
df.to_csv('../Data Preprocessing/alonhadat_processed.csv', index=False)