# Real estate valuation in Danang

## Problem statement: Real estate valuation

### Target
Build a model to predict real estate prices (houses and land) based on related characteristics such as location, area, legality, house direction, amenities, etc. This system helps buyers, sellers and investors accurately value real estate assets, supporting reasonable buying or investing decisions.

### Input
- Input data includes factors affecting real estate prices:
- Location:
  - Ward Name
  - Street Name
  - Longitude, Latitude
- Area & size information:
  - Area (m²)
  - Width (m)
  - Length (m)
  - Real estate type & legality:
  - Land Type
  - Legal Status
- Property features:
  - House Direction
  - Property Features
  - Floors
  - Rooms
  - Area of ​​​​use
  - Toilets
  - House type
  - Interior status

### Output
Real estate price prediction model based on factors input factors, providing an estimated price for a real estate asset.

### Category of problem
This is a regression problem, in which a machine learning model will predict a numerical value (real estate price) based on many related input variables.

### Practical applications
- Buyers & sellers: Determine a fair price when trading real estate.
- Banks & financial institutions: Appraise the value of mortgaged assets.

## Data Collection [NhaTot](https://www.nhatot.com/mua-ban-dat-da-nang):

In [1]:
import requests
import pandas as pd
from datetime import datetime
import os

#### Attach url link

In [2]:
land_base_url = "https://gateway.chotot.com/v1/public/ad-listing?region_v2=3017&cg=1040&sp=1&st=s,k&limit=20&w=1&include_expired_ads=true&key_param_included=true"
house_base_url = "https://gateway.chotot.com/v1/public/ad-listing?region_v2=3017&cg=1020&sp=1&st=s,k&limit=20&w=1&include_expired_ads=true&key_param_included=true"

#### Collect primary data about land (using API crawling method)

In [3]:
all_land_data = []
page = 1  # Start from page 1
offset = 0  # Start with offset = 0

while True:
    # Construct the URL with the current page and offset
    url = f"{land_base_url}&page={page}&o={offset}"

    # Send GET request to fetch data
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()  # Convert response to JSON
        ads = data.get("ads", [])  # Get the list of ads
        
        # If no more ads are found, stop the loop
        if not ads:
            print(f"No more data at page {page}, stopping...")
            break
        
        # Extract fields        
        for ad in ads:
            all_land_data.append({
                "ID": ad.get("ad_id"),
                "List ID": ad.get("list_id"),
                "Posted Time": ad.get("list_time"),
                "Status": ad.get("state"),
                "Transaction Type": ad.get("type"),
                "Region (Code)": ad.get("region"),
                "Region Name": ad.get("region_name"),
                "Category": ad.get("category"),
                "Category Name": ad.get("category_name"),
                "Company Ad": ad.get("company_ad"),
                "Title": ad.get("subject"),
                "Description": ad.get("body"),
                "Price": ad.get("price"),
                "Price per m²": ad.get("price_million_per_m2"),
                "City/Province": ad.get("region_name"),
                "Area": ad.get("size"),
                "Width": ad.get("width"),
                "Length": ad.get("length"),
                "Street Name": ad.get("street_name"),
                "Legal Status": ad.get("property_legal_document"),
                "House Direction": ad.get("direction"),
                "Longitude": ad.get("longitude"),
                "Latitude": ad.get("latitude"),
                "Location": ad.get("location"),
                "Land Type": ad.get("land_type"),
                "Sold Ads Count": ad.get("sold_ads"),
                "Active Ads Count": ad.get("seller_info", {}).get("live_ads"),
                "Seller Avatar": ad.get("avatar"),
                "Post Time": ad.get("date"),
                "Ward Name": ad.get("ward_name"),
                "District Name": ad.get("area_name"),
                "Property Features": "; ".join(map(str, ad.get("pty_characteristics", []))),
                "Additional Features": "; ".join(map(str, ad.get("ad_features", []))),
            })

        print(f"Fetched {len(ads)} ads from page {page}, offset {offset}.")

        # Increase for the next loop
        page += 1
        offset += 20  # Offset increases by 20 each time
    else:
        print(f"Error fetching data from page {page}, offset {offset}! Status code: {response.status_code}")
        break  # Stop in case of an error

# Convert data to DataFrame
df_land = pd.DataFrame(all_land_data)

Fetched 20 ads from page 1, offset 0.
Fetched 20 ads from page 2, offset 20.
Fetched 20 ads from page 3, offset 40.
Fetched 20 ads from page 4, offset 60.
Fetched 20 ads from page 5, offset 80.
Fetched 20 ads from page 6, offset 100.
Fetched 20 ads from page 7, offset 120.
Fetched 20 ads from page 8, offset 140.
Fetched 20 ads from page 9, offset 160.
Fetched 20 ads from page 10, offset 180.
Fetched 20 ads from page 11, offset 200.
Fetched 20 ads from page 12, offset 220.
Fetched 20 ads from page 13, offset 240.
Fetched 20 ads from page 14, offset 260.
Fetched 20 ads from page 15, offset 280.
Fetched 20 ads from page 16, offset 300.
Fetched 20 ads from page 17, offset 320.
Fetched 20 ads from page 18, offset 340.
Fetched 20 ads from page 19, offset 360.
Fetched 20 ads from page 20, offset 380.
Fetched 20 ads from page 21, offset 400.
Fetched 20 ads from page 22, offset 420.
Fetched 20 ads from page 23, offset 440.
Fetched 20 ads from page 24, offset 460.
Fetched 20 ads from page 25, of

In [4]:
all_house_data = []
page = 1 
offset = 0 

while True:
    url = f"{house_base_url}&page={page}&o={offset}"

    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()  
        ads = data.get("ads", []) 
        
        if not ads:
            print(f"No more data at page {page}, stopping...")
            break
        
        for ad in ads:
            all_house_data.append({
                "Ad ID": ad.get("ad_id"),
                "List ID": ad.get("list_id"),
                "Posted Time": ad.get("list_time"),
                "State": ad.get("state"),
                "Transaction Type": ad.get("type"),
                "Category": ad.get("category"),
                "Category Name": ad.get("category_name"),
                "Company Ad": ad.get("company_ad"),
                "Title": ad.get("subject"),
                "Description": ad.get("body"),
                "Price": ad.get("price"),
                "Price per m²": ad.get("price_million_per_m2"),
                "City/Province": ad.get("region_name"),
                "Area": ad.get("size"),
                "Width": ad.get("width"),
                "Length": ad.get("length"),
                "Street Name": ad.get("street_name"),
                "Legal Status": ad.get("property_legal_document"),
                "House Direction": ad.get("direction"),
                "Longitude": ad.get("longitude"),
                "Latitude": ad.get("latitude"),
                "Location": ad.get("location"),
                "Floors": ad.get("floors"),
                "Rooms": ad.get("rooms"),
                "Toilets": ad.get("toilets"),
                "Furnishing Sell": ad.get("furnishing_sell"),
                "Sold Ads Count": ad.get("sold_ads"),
                "Active Ads Count": ad.get("seller_info", {}).get("live_ads"),
                "Seller Avatar": ad.get("seller_info", {}).get("avatar"),
                "Post Time": ad.get("date"),
                "Ward Name": ad.get("ward_name"),
                "District Name": ad.get("area_name")
            })
            
        print(f"Fetched {len(ads)} ads from page {page}, offset {offset}.")
        
        page += 1
        offset += 20
    else:
        print(f"Request failed with status code {response.status_code}")
        break

df_house = pd.DataFrame(all_house_data)

Fetched 20 ads from page 1, offset 0.
Fetched 20 ads from page 2, offset 20.
Fetched 20 ads from page 3, offset 40.
Fetched 20 ads from page 4, offset 60.
Fetched 20 ads from page 5, offset 80.
Fetched 20 ads from page 6, offset 100.
Fetched 20 ads from page 7, offset 120.
Fetched 20 ads from page 8, offset 140.
Fetched 20 ads from page 9, offset 160.
Fetched 20 ads from page 10, offset 180.
Fetched 20 ads from page 11, offset 200.
Fetched 20 ads from page 12, offset 220.
Fetched 20 ads from page 13, offset 240.
Fetched 20 ads from page 14, offset 260.
Fetched 20 ads from page 15, offset 280.
Fetched 20 ads from page 16, offset 300.
Fetched 20 ads from page 17, offset 320.
Fetched 20 ads from page 18, offset 340.
Fetched 20 ads from page 19, offset 360.
Fetched 20 ads from page 20, offset 380.
Fetched 20 ads from page 21, offset 400.
Fetched 20 ads from page 22, offset 420.
Fetched 20 ads from page 23, offset 440.
Fetched 20 ads from page 24, offset 460.
Fetched 20 ads from page 25, of

#### Characteristic rough change

##### Change the time to a value in the form DD/MM/YYYY

In [5]:
# Function to convert timestamp (milliseconds) to day/month/year format
def convert_timestamp(timestamp):
    if pd.isna(timestamp):  # Check for NaN values
        return None
    try:
        return datetime.fromtimestamp(int(timestamp) / 1000).strftime("%d/%m/%Y")
    except ValueError:
        return None  # Handle invalid timestamp

# Apply conversion to the "Posted Time" column
df_land["Posted Time"] = df_land["Posted Time"].apply(convert_timestamp)
df_house["Posted Time"] = df_house["Posted Time"].apply(convert_timestamp)

#### Export to csv file

In [6]:
today_str = datetime.today().strftime("%Y-%m-%d")

In [7]:
csv_file = f"RawHouseData_{today_str}.csv"
df_house.to_csv(csv_file, index=False, encoding="utf-8-sig")

print(f"Data has been processed and saved to '{csv_file}' with {len(df_house)} ads.")

Data has been processed and saved to 'RawHouseData_2025-04-22.csv' with 7447 ads.


In [8]:
csv_file = f"RawLandData_{today_str}.csv"
df_land.to_csv(csv_file, index=False, encoding="utf-8-sig")

print(f"Data has been processed and saved to '{csv_file}' with {len(df_land)} ads.")

Data has been processed and saved to 'RawLandData_2025-04-22.csv' with 6887 ads.


In [9]:
common_columns = list(set(df_land.columns) & set(df_house.columns))
land_only_columns = [col for col in df_land.columns if col not in common_columns]
house_only_columns = [col for col in df_house.columns if col not in common_columns]

for col in house_only_columns:
    df_land[col] = 0 

for col in land_only_columns:
    df_house[col] = 0 
    
df_combined = pd.concat([df_land, df_house], ignore_index=True)

print(df_combined.head())

          ID    List ID Posted Time    Status Transaction Type  Region (Code)  \
0  166182123  123844119  26/03/2025  accepted                s              3   
1  166732350  124317035  15/04/2025  accepted                s              3   
2  166852837  124421115  19/04/2025  accepted                s              3   
3  166702487  124291289  14/04/2025  accepted                s              3   
4  165905933  123605746  20/04/2025  accepted                s              3   

  Region Name  Category Category Name Company Ad  ...     Ward Name  \
0     Đà Nẵng      1040           Đất       None  ...  Xã Hòa Phong   
1     Đà Nẵng      1040           Đất       None  ...    Xã Hòa Sơn   
2     Đà Nẵng      1040           Đất       None  ...   Xã Hòa Ninh   
3     Đà Nẵng      1040           Đất       True  ...    Xã Hòa Bắc   
4     Đà Nẵng      1040           Đất       None  ...    Xã Hòa Sơn   

    District Name  Property Features  Additional Features Ad ID  State  \
0  Huyện Hòa

In [10]:
csv_file = "RawData.csv"

if os.path.exists(csv_file):
    old_df = pd.read_csv(csv_file, encoding="utf-8-sig")
else:
    old_df = pd.DataFrame()  

if not old_df.empty:
    merged_df = pd.concat([old_df, df_combined]) 
    merged_df = merged_df.drop_duplicates(subset=["Posted Time", "Latitude", "Longitude", "Category"], keep="first") 
else:
    merged_df = df_combined 

merged_df.to_csv(csv_file, index=False, encoding="utf-8-sig")

print(f"Data has been processed and saved to '{csv_file}' with {len(merged_df)} unique ads.")

  old_df = pd.read_csv(csv_file, encoding="utf-8-sig")


Data has been processed and saved to 'RawData.csv' with 27208 unique ads.


#### Overview description of the dataset:

In [11]:
df_land.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6887 entries, 0 to 6886
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   6887 non-null   int64  
 1   List ID              6887 non-null   int64  
 2   Posted Time          6887 non-null   object 
 3   Status               6887 non-null   object 
 4   Transaction Type     6887 non-null   object 
 5   Region (Code)        6887 non-null   int64  
 6   Region Name          6887 non-null   object 
 7   Category             6887 non-null   int64  
 8   Category Name        6887 non-null   object 
 9   Company Ad           6116 non-null   object 
 10  Title                6887 non-null   object 
 11  Description          6887 non-null   object 
 12  Price                6887 non-null   int64  
 13  Price per m²         6887 non-null   float64
 14  City/Province        6887 non-null   object 
 15  Area                 6887 non-null   f

In [12]:
df_house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7447 entries, 0 to 7446
Data columns (total 39 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Ad ID                7447 non-null   int64  
 1   List ID              7447 non-null   int64  
 2   Posted Time          7447 non-null   object 
 3   State                7447 non-null   object 
 4   Transaction Type     7447 non-null   object 
 5   Category             7447 non-null   int64  
 6   Category Name        7447 non-null   object 
 7   Company Ad           6496 non-null   object 
 8   Title                7447 non-null   object 
 9   Description          7447 non-null   object 
 10  Price                7447 non-null   int64  
 11  Price per m²         7446 non-null   float64
 12  City/Province        7447 non-null   object 
 13  Area                 7446 non-null   float64
 14  Width                5213 non-null   float64
 15  Length               4051 non-null   f