# Real estate valuation in Danang

## Problem statement: Real estate valuation

### Target
Build a model to predict real estate prices (houses and land) based on related characteristics such as location, area, legality, house direction, amenities, etc. This system helps buyers, sellers and investors accurately value real estate assets, supporting reasonable buying or investing decisions.

### Input
- Input data includes factors affecting real estate prices:
- Location:
  - Ward Name
  - Street Name
  - Longitude, Latitude
- Area & size information:
  - Area (m²)
  - Width (m)
  - Length (m)
  - Real estate type & legality:
  - Land Type
  - Legal Status
- Property features:
  - House Direction
  - Property Features
  - Floors
  - Rooms
  - Area of ​​​​use
  - Toilets
  - House type
  - Interior status

### Output
Real estate price prediction model based on factors input factors, providing an estimated price for a real estate asset.

### Category of problem
This is a regression problem, in which a machine learning model will predict a numerical value (real estate price) based on many related input variables.

### Practical applications
- Buyers & sellers: Determine a fair price when trading real estate.
- Banks & financial institutions: Appraise the value of mortgaged assets.

## Data Collection [NhaTot](https://www.nhatot.com/mua-ban-dat-da-nang):

In [1]:
import requests
import pandas as pd
from datetime import datetime
import os

#### Attach url link

In [2]:
base_url = "https://gateway.chotot.com/v1/public/ad-listing?region_v2=3017&cg=1040&sp=1&st=s,k&limit=20&w=1&include_expired_ads=true&key_param_included=true"

#### Collect primary data (using API crawling method)

In [3]:
all_data = []
page = 1  # Start from page 1
offset = 0  # Start with offset = 0

while True:
    # Construct the URL with the current page and offset
    url = f"{base_url}&page={page}&o={offset}"

    # Send GET request to fetch data
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()  # Convert response to JSON
        ads = data.get("ads", [])  # Get the list of ads
        
        # If no more ads are found, stop the loop
        if not ads:
            print(f"No more data at page {page}, stopping...")
            break
        
        # Extract fields        
        for ad in ads:
            all_data.append({
                "ID": ad.get("ad_id"),
                "List ID": ad.get("list_id"),
                "Posted Time": ad.get("list_time"),
                "Status": ad.get("state"),
                "Transaction Type": ad.get("type"),
                "Seller Name": ad.get("account_name"),
                "Seller ID": ad.get("account_id"),
                "Region (Code)": ad.get("region"),
                "Region Name": ad.get("region_name"),
                "Category": ad.get("category"),
                "Category Name": ad.get("category_name"),
                "Company Ad": ad.get("company_ad"),
                "Title": ad.get("subject"),
                "Description": ad.get("body"),
                "Price": ad.get("price"),
                "Price per m²": ad.get("price_million_per_m2"),
                "City/Province": ad.get("region_name"),
                "Area (m²)": ad.get("size"),
                "Width (m)": ad.get("width"),
                "Length (m)": ad.get("length"),
                "Street Name": ad.get("street_name"),
                "Legal Status": ad.get("property_legal_document"),
                "House Direction": ad.get("direction"),
                "Longitude": ad.get("longitude"),
                "Latitude": ad.get("latitude"),
                "Location": ad.get("location"),
                "Land Type": ad.get("land_type"),
                "Sold Ads Count": ad.get("sold_ads"),
                "Active Ads Count": ad.get("seller_info", {}).get("live_ads"),
                "Seller Avatar": ad.get("avatar"),
                "Post Time": ad.get("date"),
                "First Image Link": ad["images"][0] if ad.get("images") else None,
                "Image List": "; ".join(ad.get("images", [])),
                "Number of Images": ad.get("number_of_images"),
                "Thumbnail Image": ad.get("thumbnail_image"),
                "Map Image": ad.get("pty_map"),
                "Ward Name": ad.get("ward_name"),
                "Property Features": "; ".join(map(str, ad.get("pty_characteristics", []))),
                "Additional Features": "; ".join(map(str, ad.get("ad_features", []))),
                "Land Type (Text)": next((p["value"] for p in ad.get("params", []) if p["id"] == "land_type"), None),
                "Width (Text)": next((p["value"] for p in ad.get("params", []) if p["id"] == "width"), None),
            })

        print(f"Fetched {len(ads)} ads from page {page}, offset {offset}.")

        # Increase for the next loop
        page += 1
        offset += 20  # Offset increases by 20 each time
    else:
        print(f"Error fetching data from page {page}, offset {offset}! Status code: {response.status_code}")
        break  # Stop in case of an error

# Convert data to DataFrame
df = pd.DataFrame(all_data)

Fetched 20 ads from page 1, offset 0.
Fetched 20 ads from page 2, offset 20.
Fetched 20 ads from page 3, offset 40.
Fetched 20 ads from page 4, offset 60.
Fetched 20 ads from page 5, offset 80.
Fetched 20 ads from page 6, offset 100.
Fetched 20 ads from page 7, offset 120.
Fetched 20 ads from page 8, offset 140.
Fetched 20 ads from page 9, offset 160.
Fetched 20 ads from page 10, offset 180.
Fetched 20 ads from page 11, offset 200.
Fetched 20 ads from page 12, offset 220.
Fetched 20 ads from page 13, offset 240.
Fetched 20 ads from page 14, offset 260.
Fetched 20 ads from page 15, offset 280.
Fetched 20 ads from page 16, offset 300.
Fetched 20 ads from page 17, offset 320.
Fetched 20 ads from page 18, offset 340.
Fetched 20 ads from page 19, offset 360.
Fetched 20 ads from page 20, offset 380.
Fetched 20 ads from page 21, offset 400.
Fetched 20 ads from page 22, offset 420.
Fetched 20 ads from page 23, offset 440.
Fetched 20 ads from page 24, offset 460.
Fetched 20 ads from page 25, of

#### Characteristic rough change

##### Change the time to a value in the form DD/MM/YYYY

In [4]:
# Function to convert timestamp (milliseconds) to day/month/year format
def convert_timestamp(timestamp):
    if pd.isna(timestamp):  # Check for NaN values
        return None
    try:
        return datetime.fromtimestamp(int(timestamp) / 1000).strftime("%d/%m/%Y")
    except ValueError:
        return None  # Handle invalid timestamp

# Apply conversion to the "Posted Time" column
df["Posted Time"] = df["Posted Time"].apply(convert_timestamp)

#### Export to csv file

In [5]:
today_str = datetime.today().strftime("%Y-%m-%d")
csv_file = f"RawData_{today_str}.csv"
df.to_csv(csv_file, index=False, encoding="utf-8-sig")

print(f"Data has been processed and saved to '{csv_file}' with {len(df)} ads.")

Data has been processed and saved to 'RawData_2025-02-20.csv' with 3707 ads.


Currently there will always be a RawData file, take that file out and remove the df data if it has the same date and coordinates.

In [7]:
csv_file = "RawData.csv"

if os.path.exists(csv_file):
    old_df = pd.read_csv(csv_file, encoding="utf-8-sig")
else:
    old_df = pd.DataFrame()  

if not old_df.empty:
    merged_df = pd.concat([old_df, df]) 
    merged_df = merged_df.drop_duplicates(subset=["Posted Time", "Latitude", "Longitude"], keep="first") 
else:
    merged_df = df 

merged_df.to_csv(csv_file, index=False, encoding="utf-8-sig")

print(f"Data has been processed and saved to '{csv_file}' with {len(merged_df)} unique ads.")

Data has been processed and saved to 'RawData.csv' with 4393 unique ads.


#### Overview description of the dataset:

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3689 entries, 0 to 3688
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   3689 non-null   int64  
 1   List ID              3689 non-null   int64  
 2   Posted Time          3689 non-null   object 
 3   Status               3689 non-null   object 
 4   Transaction Type     3689 non-null   object 
 5   Seller Name          3689 non-null   object 
 6   Seller ID            3689 non-null   int64  
 7   Region (Code)        3689 non-null   int64  
 8   Region Name          3689 non-null   object 
 9   Category             3689 non-null   int64  
 10  Category Name        3689 non-null   object 
 11  Company Ad           3282 non-null   object 
 12  Title                3689 non-null   object 
 13  Description          3689 non-null   object 
 14  Price                3689 non-null   int64  
 15  Price per m²         3689 non-null   f