# World's Real Estate Data Analysis - Data Preprocessing

This notebook performs data preprocessing on the world's real estate dataset. The recommended preprocessing steps and order are:
1. Data Loading and Initial Exploration
2. Data Type Conversions
3. Cleaning Unrealistic Values
4. Handling Missing Values
5. Feature Engineering & Categorical Data Processing
6. Final Cleanup

In [2]:
import pandas as pd

df = pd.read_csv("data/world_real_estate_data.csv")

# 1. Initial Exploration and Quality Check

Let's start by loading the dataset and examining its basic properties:
- Shape (number of objects and attributes)
- First few records
- Last few records
- Data information (data types and non-null counts)
- Basic statistics
- Missing value analysis
- Check for duplicates

In [3]:
df.shape

(147536, 14)

In [4]:
df.head()

Unnamed: 0,title,country,location,building_construction_year,building_total_floors,apartment_floor,apartment_rooms,apartment_bedrooms,apartment_bathrooms,apartment_total_area,apartment_living_area,price_in_USD,image,url
0,2 room apartment 120 m² in Mediterranean Regio...,Turkey,"Mediterranean Region, Turkey",,5.0,1.0,3.0,2.0,2.0,120 m²,110 m²,315209.0,https://realting.com/uploads/bigSlider/ab3/888...,https://realting.com/property-for-sale/turkey/...
1,"4 room villa 500 m² in Kalkan, Turkey",Turkey,"Kalkan, Mediterranean Region, Kas, Turkey",2021.0,2.0,,,,,500 m²,480 m²,1108667.0,https://realting.com/uploads/bigSlider/87b/679...,https://realting.com/property-for-sale/turkey/...
2,"1 room apartment 65 m² in Antalya, Turkey",Turkey,"Mediterranean Region, Antalya, Turkey",,5.0,2.0,2.0,1.0,1.0,65 m²,60 m²,173211.0,https://realting.com/uploads/bigSlider/030/a11...,https://realting.com/property-for-sale/turkey/...
3,"1 room apartment in Pattaya, Thailand",Thailand,"Chon Buri Province, Pattaya, Thailand",2020.0,15.0,5.0,2.0,1.0,1.0,,40 m²,99900.0,https://realting.com/uploads/bigSlider/e9a/e06...,https://realting.com/property-for-sale/thailan...
4,"2 room apartment in Pattaya, Thailand",Thailand,"Chon Buri Province, Pattaya, Thailand",2026.0,8.0,3.0,3.0,2.0,1.0,,36 m²,67000.0,https://realting.com/uploads/bigSlider/453/aa2...,https://realting.com/property-for-sale/thailan...


In [5]:
df.tail()

Unnamed: 0,title,country,location,building_construction_year,building_total_floors,apartment_floor,apartment_rooms,apartment_bedrooms,apartment_bathrooms,apartment_total_area,apartment_living_area,price_in_USD,image,url
147531,"5 room apartment 310 m² in Gazipasa, Turkey",Turkey,"Mediterranean Region, Gazipasa, Turkey",,,,,5.0,,310 m²,,597810.0,https://realting.com/uploads/bigSlider/e4a/67f...,https://realting.com/property-for-sale/turkey/...
147532,"4 room apartment 192 m² in Marmara Region, Turkey",Turkey,"Marmara Region, Turkey",2023.0,5.0,,5.0,4.0,2.0,192 m²,151 m²,637195.0,https://realting.com/uploads/bigSlider/93e/5c6...,https://realting.com/property-for-sale/turkey/...
147533,"2 room apartment in Marmara Region, Turkey",Turkey,"Marmara Region, Turkey",,,,3.0,2.0,2.0,,84 m²,477146.0,https://realting.com/uploads/bigSlider/4ae/9d8...,https://realting.com/property-for-sale/turkey/...
147534,"Apartment in Akarca, Turkey",Turkey,"Akarca, Central Anatolia Region, Turkey",2023.0,,,,,,,,819163.0,https://realting.com/uploads/bigSlider/164/7e6...,https://realting.com/property-for-sale/turkey/...
147535,"4 room apartment 140 m² in, Turkey",Turkey,Turkey,,2.0,,5.0,4.0,,140 m²,,939164.0,https://realting.com/uploads/bigSlider/fab/0eb...,https://realting.com/property-for-sale/turkey/...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   title                       147536 non-null  object 
 1   country                     147406 non-null  object 
 2   location                    147405 non-null  object 
 3   building_construction_year  64719 non-null   float64
 4   building_total_floors       68224 non-null   float64
 5   apartment_floor             54592 non-null   float64
 6   apartment_rooms             74178 non-null   float64
 7   apartment_bedrooms          36982 non-null   float64
 8   apartment_bathrooms         55973 non-null   float64
 9   apartment_total_area        141796 non-null  object 
 10  apartment_living_area       27712 non-null   object 
 11  price_in_USD                144961 non-null  float64
 12  image                       147536 non-null  object 
 13  url           

In [7]:
df.describe()

Unnamed: 0,building_construction_year,building_total_floors,apartment_floor,apartment_rooms,apartment_bedrooms,apartment_bathrooms,price_in_USD
count,64719.0,68224.0,54592.0,74178.0,36982.0,55973.0,144961.0
mean,1996.921754,8.575692,5.791709,2.572097,2.289222,1.364229,412172.2
std,157.527635,8.356781,5.541368,1.319545,18.276913,0.745019,842098.4
min,1.0,-1.0,-2.0,-1.0,-1.0,1.0,0.0
25%,2004.0,2.0,2.0,2.0,1.0,1.0,105420.0
50%,2021.0,5.0,4.0,2.0,2.0,1.0,190212.0
75%,2024.0,14.0,8.0,3.0,3.0,2.0,398930.0
max,2316.0,124.0,202.0,124.0,2009.0,43.0,30602830.0


In [8]:
df.describe(include='all')

Unnamed: 0,title,country,location,building_construction_year,building_total_floors,apartment_floor,apartment_rooms,apartment_bedrooms,apartment_bathrooms,apartment_total_area,apartment_living_area,price_in_USD,image,url
count,147536,147406,147405,64719.0,68224.0,54592.0,74178.0,36982.0,55973.0,141796,27712,144961.0,147536,147536
unique,78292,27,7445,,,,,,,1492,641,,113753,147536
top,"1 room apartment 24 m² in poselenie Sosenskoe,...",Turkey,"Mediterranean Region, Sekerhane Mahallesi, Ala...",,,,,,,100 m²,30 m²,,https://realting.com/uploads/bigSlider/9b9/e2b...,https://realting.com/property-for-sale/turkey/...
freq,486,25724,7244,,,,,,,2468,888,,1219,1
mean,,,,1996.921754,8.575692,5.791709,2.572097,2.289222,1.364229,,,412172.2,,
std,,,,157.527635,8.356781,5.541368,1.319545,18.276913,0.745019,,,842098.4,,
min,,,,1.0,-1.0,-2.0,-1.0,-1.0,1.0,,,0.0,,
25%,,,,2004.0,2.0,2.0,2.0,1.0,1.0,,,105420.0,,
50%,,,,2021.0,5.0,4.0,2.0,2.0,1.0,,,190212.0,,
75%,,,,2024.0,14.0,8.0,3.0,3.0,2.0,,,398930.0,,


In [9]:
df.isnull().sum()

title                              0
country                          130
location                         131
building_construction_year     82817
building_total_floors          79312
apartment_floor                92944
apartment_rooms                73358
apartment_bedrooms            110554
apartment_bathrooms            91563
apartment_total_area            5740
apartment_living_area         119824
price_in_USD                    2575
image                              0
url                                0
dtype: int64

In [10]:
(df.isnull().sum() / df.shape[0] * 100).sort_values(ascending=False)

apartment_living_area         81.216788
apartment_bedrooms            74.933576
apartment_floor               62.997506
apartment_bathrooms           62.061463
building_construction_year    56.133418
building_total_floors         53.757727
apartment_rooms               49.722102
apartment_total_area           3.890576
price_in_USD                   1.745337
location                       0.088792
country                        0.088114
title                          0.000000
image                          0.000000
url                            0.000000
dtype: float64

In [11]:
df.duplicated().sum()

0

In [12]:
duplicates = df[df['url'].duplicated(keep=False)]
print(duplicates)

Empty DataFrame
Columns: [title, country, location, building_construction_year, building_total_floors, apartment_floor, apartment_rooms, apartment_bedrooms, apartment_bathrooms, apartment_total_area, apartment_living_area, price_in_USD, image, url]
Index: []


In [13]:
cleaned_df = df.copy()

cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   title                       147536 non-null  object 
 1   country                     147406 non-null  object 
 2   location                    147405 non-null  object 
 3   building_construction_year  64719 non-null   float64
 4   building_total_floors       68224 non-null   float64
 5   apartment_floor             54592 non-null   float64
 6   apartment_rooms             74178 non-null   float64
 7   apartment_bedrooms          36982 non-null   float64
 8   apartment_bathrooms         55973 non-null   float64
 9   apartment_total_area        141796 non-null  object 
 10  apartment_living_area       27712 non-null   object 
 11  price_in_USD                144961 non-null  float64
 12  image                       147536 non-null  object 
 13  url           

# 2. Data Type Conversion

Clean and standardize area-related columns:
- Convert area values from string to numeric format
- Remove 'm²' suffix
- Handle different decimal separators (comma vs period)
- Remove whitespace
- Convert to numeric values

Convert other columns:
- Objects that are just strings
- Numeric columns that are already numeric but need nullable integer type
- Columns with repeated nominal values can be set as categories

In [14]:
cleaned_df["apartment_total_area"] = (
    cleaned_df["apartment_total_area"]
    .astype(str)
    .str.replace("m²", "", regex=False)
    .str.replace(",", ".", regex=False)
    .str.replace(r"\s+", "", regex=True)
    .str.strip()
)

cleaned_df["apartment_total_area"] = pd.to_numeric(cleaned_df["apartment_total_area"], errors="coerce")

In [15]:
cleaned_df["apartment_living_area"] = (
    cleaned_df["apartment_living_area"]
    .astype(str)
    .str.replace("m²", "", regex=False)
    .str.replace(",", ".", regex=False)
    .str.replace(r"\s+", "", regex=True)
    .str.strip()
)

cleaned_df["apartment_living_area"] = pd.to_numeric(cleaned_df["apartment_living_area"], errors="coerce")

cleaned_df.head()

Unnamed: 0,title,country,location,building_construction_year,building_total_floors,apartment_floor,apartment_rooms,apartment_bedrooms,apartment_bathrooms,apartment_total_area,apartment_living_area,price_in_USD,image,url
0,2 room apartment 120 m² in Mediterranean Regio...,Turkey,"Mediterranean Region, Turkey",,5.0,1.0,3.0,2.0,2.0,120.0,110.0,315209.0,https://realting.com/uploads/bigSlider/ab3/888...,https://realting.com/property-for-sale/turkey/...
1,"4 room villa 500 m² in Kalkan, Turkey",Turkey,"Kalkan, Mediterranean Region, Kas, Turkey",2021.0,2.0,,,,,500.0,480.0,1108667.0,https://realting.com/uploads/bigSlider/87b/679...,https://realting.com/property-for-sale/turkey/...
2,"1 room apartment 65 m² in Antalya, Turkey",Turkey,"Mediterranean Region, Antalya, Turkey",,5.0,2.0,2.0,1.0,1.0,65.0,60.0,173211.0,https://realting.com/uploads/bigSlider/030/a11...,https://realting.com/property-for-sale/turkey/...
3,"1 room apartment in Pattaya, Thailand",Thailand,"Chon Buri Province, Pattaya, Thailand",2020.0,15.0,5.0,2.0,1.0,1.0,,40.0,99900.0,https://realting.com/uploads/bigSlider/e9a/e06...,https://realting.com/property-for-sale/thailan...
4,"2 room apartment in Pattaya, Thailand",Thailand,"Chon Buri Province, Pattaya, Thailand",2026.0,8.0,3.0,3.0,2.0,1.0,,36.0,67000.0,https://realting.com/uploads/bigSlider/453/aa2...,https://realting.com/property-for-sale/thailan...


In [16]:
text_cols = ["title", "image", "url"]

for col in text_cols:
    cleaned_df[col] = cleaned_df[col].astype("string").str.strip()

In [17]:
cleaned_df['country'] = cleaned_df['country'].astype('category')
cleaned_df['location'] = cleaned_df['location'].astype('category')

In [18]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   title                       147536 non-null  string  
 1   country                     147406 non-null  category
 2   location                    147405 non-null  category
 3   building_construction_year  64719 non-null   float64 
 4   building_total_floors       68224 non-null   float64 
 5   apartment_floor             54592 non-null   float64 
 6   apartment_rooms             74178 non-null   float64 
 7   apartment_bedrooms          36982 non-null   float64 
 8   apartment_bathrooms         55973 non-null   float64 
 9   apartment_total_area        141796 non-null  float64 
 10  apartment_living_area       27712 non-null   float64 
 11  price_in_USD                144961 non-null  float64 
 12  image                       147536 non-null  string  
 13 

# 3. Cleaning Unrealistic Values

In this section, we'll clean unrealistic values in our dataset:
- Fix building construction years (range: 1800-2025)
- Set reasonable limits for rooms and floors
- Ensure logical relationships between:
  - Apartment floor and building total floors
  - Number of bedrooms and total rooms
  - Number of bathrooms and total rooms
  - Total rooms and specialized rooms (bedrooms + bathrooms)

In [19]:
cleaned_df['building_construction_year'] = cleaned_df['building_construction_year'].clip(1800, 2025)

max_limits = {
    'building_total_floors': 150, 
    'apartment_floor': 150,
    'apartment_rooms': 20,
    'apartment_bedrooms': 15,
    'apartment_bathrooms': 10
}

# Remove negative values and clip to maximum limits
for col, max_val in max_limits.items():
    cleaned_df[col] = cleaned_df[col].clip(lower=0, upper=max_val)

# Make sure apartment floor is never higher than building total floors
cleaned_df['apartment_floor'] = cleaned_df.apply(
    lambda row: min(row['apartment_floor'], row['building_total_floors'])
    if pd.notnull(row['apartment_floor']) and pd.notnull(row['building_total_floors'])
    else row['apartment_floor'],
    axis=1
)

# Make sure bedrooms <= rooms
cleaned_df['apartment_bedrooms'] = cleaned_df.apply(
    lambda row: min(row['apartment_bedrooms'], row['apartment_rooms'])
    if pd.notnull(row['apartment_bedrooms']) and pd.notnull(row['apartment_rooms'])
    else row['apartment_bedrooms'],
    axis=1
)

# Make sure bathrooms <= rooms
cleaned_df['apartment_bathrooms'] = cleaned_df.apply(
    lambda row: min(row['apartment_bathrooms'], row['apartment_rooms'])
    if pd.notnull(row['apartment_bathrooms']) and pd.notnull(row['apartment_rooms'])
    else row['apartment_bathrooms'],
    axis=1
)

# Ensure total rooms is enough to accommodate bedrooms, bathrooms, and at least one common area
def adjust_room_counts(row):
    if pd.isnull(row['apartment_rooms']) or pd.isnull(row['apartment_bedrooms']) or pd.isnull(row['apartment_bathrooms']):
        return row
    
    total_specialized_rooms = row['apartment_bedrooms'] + row['apartment_bathrooms']
    
    # If sum of specialized rooms exceeds or equals total rooms, adjust total rooms
    # Add 1 for minimum common area (living room/kitchen)
    if total_specialized_rooms >= row['apartment_rooms']:
        row['apartment_rooms'] = total_specialized_rooms + 1
    
    return row

cleaned_df = cleaned_df.apply(adjust_room_counts, axis=1)

print("Summary statistics after cleaning unrealistic values:")
print(cleaned_df[list(max_limits.keys()) + ['building_construction_year']].describe())

print("\nRoom relationships summary:")
print(cleaned_df[['apartment_rooms', 'apartment_bedrooms', 'apartment_bathrooms']].describe())


Summary statistics after cleaning unrealistic values:
       building_total_floors  apartment_floor  apartment_rooms  \
count           68224.000000     54592.000000     74178.000000   
mean                8.575706         5.726022         2.954313   
std                 8.356765         5.456325         1.499847   
min                 0.000000         0.000000         0.000000   
25%                 2.000000         2.000000         2.000000   
50%                 5.000000         4.000000         3.000000   
75%                14.000000         8.000000         4.000000   
max               124.000000        90.000000        25.000000   

       apartment_bedrooms  apartment_bathrooms  building_construction_year  
count        36982.000000         55973.000000                64719.000000  
mean             2.085880             1.348454                 2007.778118  
std              1.058409             0.673297                   30.297252  
min              0.000000             0.000

# 

# 4. Handle Missing Values



In [20]:
cleaned_df["apartment_total_area"] = cleaned_df["apartment_total_area"].fillna(
    cleaned_df["apartment_total_area"].median()
)

cleaned_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   title                       147536 non-null  object 
 1   country                     147406 non-null  object 
 2   location                    147405 non-null  object 
 3   building_construction_year  64719 non-null   float64
 4   building_total_floors       68224 non-null   float64
 5   apartment_floor             54592 non-null   float64
 6   apartment_rooms             74178 non-null   float64
 7   apartment_bedrooms          36982 non-null   float64
 8   apartment_bathrooms         55973 non-null   float64
 9   apartment_total_area        147536 non-null  float64
 10  apartment_living_area       27712 non-null   float64
 11  price_in_USD                144961 non-null  float64
 12  image                       147536 non-null  object 
 13  url           

In [21]:
import re

def extract_room_info(title):
    bedrooms = None
    rooms = None

    bed_match = re.search(r'(\d+)\s*(bedroom|bedrooms|br)', title, re.IGNORECASE)
    if bed_match:
        bedrooms = int(bed_match.group(1))

    room_match = re.search(r'(\d+)\s*(?<!bed)room[s]?', title, re.IGNORECASE)
    if room_match:
        rooms = int(room_match.group(1))

    return bedrooms, rooms


for i, title in enumerate(cleaned_df["title"]):
    bed, room = extract_room_info(title)

    if pd.isna(cleaned_df.at[i, "apartment_bedrooms"]) and bed is not None:
        cleaned_df.at[i, "apartment_bedrooms"] = bed

    if pd.isna(cleaned_df.at[i, "apartment_rooms"]) and room is not None:
        cleaned_df.at[i, "apartment_rooms"] = room

print(cleaned_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   title                       147536 non-null  object 
 1   country                     147406 non-null  object 
 2   location                    147405 non-null  object 
 3   building_construction_year  64719 non-null   float64
 4   building_total_floors       68224 non-null   float64
 5   apartment_floor             54592 non-null   float64
 6   apartment_rooms             120715 non-null  float64
 7   apartment_bedrooms          40880 non-null   float64
 8   apartment_bathrooms         55973 non-null   float64
 9   apartment_total_area        147536 non-null  float64
 10  apartment_living_area       27712 non-null   float64
 11  price_in_USD                144961 non-null  float64
 12  image                       147536 non-null  object 
 13  url           

In [22]:
cleaned_df["apartment_rooms"] = cleaned_df["apartment_rooms"].fillna(
    cleaned_df["apartment_rooms"].median()
)

cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   title                       147536 non-null  object 
 1   country                     147406 non-null  object 
 2   location                    147405 non-null  object 
 3   building_construction_year  64719 non-null   float64
 4   building_total_floors       68224 non-null   float64
 5   apartment_floor             54592 non-null   float64
 6   apartment_rooms             147536 non-null  float64
 7   apartment_bedrooms          40880 non-null   float64
 8   apartment_bathrooms         55973 non-null   float64
 9   apartment_total_area        147536 non-null  float64
 10  apartment_living_area       27712 non-null   float64
 11  price_in_USD                144961 non-null  float64
 12  image                       147536 non-null  object 
 13  url           

In [23]:
cleaned_df['country'] = cleaned_df['country'].fillna(cleaned_df['country'].mode()[0])
cleaned_df['location'] = (
    cleaned_df.groupby('country', observed=False)['location']
    .transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else 'Unknown'))
)

cleaned_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 14 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   title                       147536 non-null  object 
 1   country                     147536 non-null  object 
 2   location                    147536 non-null  object 
 3   building_construction_year  64719 non-null   float64
 4   building_total_floors       68224 non-null   float64
 5   apartment_floor             54592 non-null   float64
 6   apartment_rooms             147536 non-null  float64
 7   apartment_bedrooms          40880 non-null   float64
 8   apartment_bathrooms         55973 non-null   float64
 9   apartment_total_area        147536 non-null  float64
 10  apartment_living_area       27712 non-null   float64
 11  price_in_USD                144961 non-null  float64
 12  image                       147536 non-null  object 
 13  url           

In [24]:
cleaned_df['location'].eq('Unknown').sum()

0

# 5. Property Type Feature Engineering

Extract property types from listing titles using regex patterns:
- Identify common property types (apartment, house, villa, etc.)
- Create a new 'property_type' column
- Handle edge cases and refine categorization
- Analyze distribution of property types

In [25]:
cleaned_df[['apartment_rooms', 'apartment_bedrooms', 'apartment_bathrooms', 'apartment_total_area', 'price_in_USD']].corr()


Unnamed: 0,apartment_rooms,apartment_bedrooms,apartment_bathrooms,apartment_total_area,price_in_USD
apartment_rooms,1.0,0.461224,0.55305,0.001077,0.301287
apartment_bedrooms,0.461224,1.0,0.692628,0.016358,0.344934
apartment_bathrooms,0.55305,0.692628,1.0,0.016039,0.441658
apartment_total_area,0.001077,0.016358,0.016039,1.0,0.002207
price_in_USD,0.301287,0.344934,0.441658,0.002207,1.0


In [26]:
import re

def extract_property_type(title):
    title_lower = title.lower()
    if re.search(r'\b(apartment|flat)\b', title_lower):
        return 'apartment'
    elif re.search(r'\bstudio\b', title_lower):
        return 'studio'
    elif re.search(r'\bhouse\b', title_lower):
        return 'house'
    elif re.search(r'\bvilla\b', title_lower):
        return 'villa'
    elif re.search(r'\boffice\b', title_lower):
        return 'office'
    elif re.search(r'\bcommercial\b', title_lower):
        return 'commercial'
    elif re.search(r'\bplot|land\b', title_lower):
        return 'land'
    else:
        return 'other'

cleaned_df['property_type'] = cleaned_df['title'].apply(extract_property_type)

cleaned_df['property_type'].value_counts()


property_type
apartment    93618
house        32344
villa        10236
other        10173
land          1165
Name: count, dtype: int64

In [27]:
cols = cleaned_df.columns.tolist()

cols.remove('property_type')

title_index = cols.index('title') + 1
cols.insert(title_index, 'property_type')

cleaned_df = cleaned_df[cols]

cleaned_df.head()

Unnamed: 0,title,property_type,country,location,building_construction_year,building_total_floors,apartment_floor,apartment_rooms,apartment_bedrooms,apartment_bathrooms,apartment_total_area,apartment_living_area,price_in_USD,image,url
0,2 room apartment 120 m² in Mediterranean Regio...,apartment,Turkey,"Mediterranean Region, Turkey",,5.0,1.0,5.0,2.0,2.0,120.0,110.0,315209.0,https://realting.com/uploads/bigSlider/ab3/888...,https://realting.com/property-for-sale/turkey/...
1,"4 room villa 500 m² in Kalkan, Turkey",villa,Turkey,"Kalkan, Mediterranean Region, Kas, Turkey",2021.0,2.0,,4.0,,,500.0,480.0,1108667.0,https://realting.com/uploads/bigSlider/87b/679...,https://realting.com/property-for-sale/turkey/...
2,"1 room apartment 65 m² in Antalya, Turkey",apartment,Turkey,"Mediterranean Region, Antalya, Turkey",,5.0,2.0,3.0,1.0,1.0,65.0,60.0,173211.0,https://realting.com/uploads/bigSlider/030/a11...,https://realting.com/property-for-sale/turkey/...
3,"1 room apartment in Pattaya, Thailand",apartment,Thailand,"Chon Buri Province, Pattaya, Thailand",2020.0,15.0,5.0,3.0,1.0,1.0,89.0,40.0,99900.0,https://realting.com/uploads/bigSlider/e9a/e06...,https://realting.com/property-for-sale/thailan...
4,"2 room apartment in Pattaya, Thailand",apartment,Thailand,"Chon Buri Province, Pattaya, Thailand",2025.0,8.0,3.0,4.0,2.0,1.0,89.0,36.0,67000.0,https://realting.com/uploads/bigSlider/453/aa2...,https://realting.com/property-for-sale/thailan...


In [28]:
other_listings = cleaned_df[cleaned_df['property_type'] == 'other']
other_listings[['title', 'country', 'location', 'apartment_rooms', 'apartment_floor', 'apartment_bedrooms', 'apartment_bathrooms']]


Unnamed: 0,title,country,location,apartment_rooms,apartment_floor,apartment_bedrooms,apartment_bathrooms
10,"Penthouse 1 bedroom 96 m² in Kyrenia, Northern...",Northern Cyprus,"Kyrenia, Girne (Kyrenia) District, Northern Cy...",4.0,,1.0,2.0
13,"Cottage 555 m² in Haranski sielski Saviet, Bel...",Belarus,"Minsk Region, Haranski sielski Saviet, Minsk D...",3.0,,,
17,"Penthouse 3 bedrooms 150 m² in Alanya, Turkey",Turkey,"Mediterranean Region, Sekerhane Mahallesi, Ala...",6.0,,3.0,2.0
96,"Multilevel apartments 3 bedrooms 95 m² in Bar,...",Montenegro,"Bar, Bar Municipality, Montenegro",6.0,5.0,3.0,2.0
111,"Penthouse 2 bedrooms 135 m² in Mahmutlar, Turkey",Turkey,"Mahmutlar, Mediterranean Region, Alanya, Turkey",5.0,,2.0,2.0
...,...,...,...,...,...,...,...
147375,"Multilevel apartments 1 bedroom 50 m² in Bali,...",Indonesia,"Bali, Indonesia",3.0,,1.0,1.0
147425,"Multilevel apartments 1 bedroom 45 m² in Bali,...",Indonesia,"Bali, Indonesia",3.0,,1.0,1.0
147507,"Cottage 2 bathrooms 200 m² in Kissomlyo, Hungary",Hungary,"Kissomlyo, Transdanubia, Celldoemoelki jaras, ...",3.0,,,
147516,"Penthouse 5 bedrooms 240 m² in Okurcalar, Turkey",Turkey,"Okurcalar, Mediterranean Region, Alanya, Turkey",9.0,,5.0,3.0


In [29]:
def refine_property_type(row):
    title = str(row['title']).lower()
    current_type = row['property_type']
    
    if current_type != 'other':
        return current_type
    
    if 'condo' in title or 'duplex' in title or 'penthouse' in title or 'apartment' in title or 'flat' in title:
        return 'apartment'
    elif 'mansion' in title or 'bungalow' in title or 'cottage' in title or 'house' in title or 'villa' in title:
        return 'house' if 'villa' not in title else 'villa'
    elif 'studio' in title:
        return 'studio'
    elif 'land' in title or 'plot' in title:
        return 'land'
    else:
        return 'other'

cleaned_df['property_type'] = cleaned_df.apply(refine_property_type, axis=1)

cleaned_df['property_type'].value_counts()

property_type
apartment    97820
house        38114
villa        10258
land          1166
other          178
Name: count, dtype: int64

In [30]:
other_listings = cleaned_df[cleaned_df['property_type'] == 'other']
other_listings[['title', 'country', 'location', 'apartment_rooms', 'apartment_floor', 'apartment_bedrooms', 'apartment_bathrooms']]

Unnamed: 0,title,country,location,apartment_rooms,apartment_floor,apartment_bedrooms,apartment_bathrooms
4101,"Room 4 bedrooms 277 m² in Turkey, Turkey",Turkey,Turkey,9.0,,4.0,4.0
5861,"Room 3 bedrooms 235 m² in Turkey, Turkey",Turkey,Turkey,7.0,,3.0,3.0
6840,"Room 3 bedrooms in Marmara Region, Turkey",Turkey,"Marmara Region, Turkey",7.0,,3.0,3.0
7047,"Room 1 bedroom 91 m² in Turkey, Turkey",Turkey,Turkey,3.0,,1.0,1.0
15294,"Room 2 bedrooms 170 m² in Turkey, Turkey",Turkey,Turkey,5.0,,2.0,2.0
...,...,...,...,...,...,...,...
140970,"54 m² in Suhobezvodnoe, Russia",Russia,"Suhobezvodnoe, Volga Federal District, Semenov...",3.0,,,
140971,"76 m² in Sergach, Russia",Russia,"Sergach, Volga Federal District, Sergachsky Di...",3.0,,,
141088,"169 m² in Nizhny Novgorod, Russia",Russia,"Volga Federal District, Nizhny Novgorod, Russia",3.0,,,
144199,"Room 4 rooms 97 m² in Tashkent, Uzbekistan",Uzbekistan,"Tashkent, Uzbekistan",4.0,2.0,,


# 6. Handle more missing values


In [31]:
cleaned_df.isnull().sum()

title                              0
property_type                      0
country                            0
location                           0
building_construction_year     82817
building_total_floors          79312
apartment_floor                92944
apartment_rooms                    0
apartment_bedrooms            106656
apartment_bathrooms            91563
apartment_total_area               0
apartment_living_area         119824
price_in_USD                    2575
image                              0
url                                0
dtype: int64

In [32]:
cleaned_df.drop(columns=['image', 'url', 'apartment_living_area'], inplace=True, errors='ignore')

In [33]:
cleaned_df['building_construction_year'] = (
    cleaned_df.groupby('country', observed=False)['building_construction_year']
    .transform(lambda x: x.fillna(x.median()))
)

cleaned_df['building_construction_year'] = cleaned_df['building_construction_year'].fillna(
    cleaned_df['building_construction_year'].median()
)



  return np.nanmean(a, axis, out=out, keepdims=keepdims)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)


In [34]:
cleaned_df.isnull().sum()

title                              0
property_type                      0
country                            0
location                           0
building_construction_year         0
building_total_floors          79312
apartment_floor                92944
apartment_rooms                    0
apartment_bedrooms            106656
apartment_bathrooms            91563
apartment_total_area               0
price_in_USD                    2575
dtype: int64

In [35]:
import numpy as np

cleaned_df['building_total_floors'] = cleaned_df.groupby('property_type')['building_total_floors'] \
                                 .transform(lambda x: x.fillna(x.median()))

cleaned_df['apartment_floor'] = cleaned_df.groupby('property_type')['apartment_floor'] \
                          .transform(lambda x: x.fillna(x.median()))

cleaned_df['apartment_floor'] = np.minimum(cleaned_df['apartment_floor'], cleaned_df['building_total_floors'])


print(cleaned_df[['property_type', 'building_total_floors', 'apartment_floor']].info())
print(cleaned_df[['property_type', 'building_total_floors', 'apartment_floor']].head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147536 entries, 0 to 147535
Data columns (total 3 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   property_type          147536 non-null  object 
 1   building_total_floors  147536 non-null  float64
 2   apartment_floor        147536 non-null  float64
dtypes: float64(2), object(1)
memory usage: 3.4+ MB
None
  property_type  building_total_floors  apartment_floor
0     apartment                    5.0              1.0
1         villa                    2.0              1.0
2     apartment                    5.0              2.0
3     apartment                   15.0              5.0
4     apartment                    8.0              3.0
5     apartment                    9.0              4.0
6     apartment                    2.0              2.0
7     apartment                    8.0              2.0
8     apartment                    9.0              4.0
9     ap

# 7. 

In [36]:
cleaned_df.groupby('property_type')[['apartment_rooms',
'apartment_bedrooms', 'apartment_bathrooms']].median()

Unnamed: 0_level_0,apartment_rooms,apartment_bedrooms,apartment_bathrooms
property_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
apartment,3.0,2.0,1.0
house,3.0,3.5,
land,3.0,2.0,1.5
other,3.0,3.0,2.0
villa,4.0,7.0,2.0


In [37]:
# Land → 0 bedrooms, 0 bathrooms
cleaned_df.loc[cleaned_df['property_type'] == 'land', ['apartment_bedrooms', 'apartment_bathrooms']] = 0

# Apartments → fill missing with median per property_type
apartment_mask = cleaned_df['property_type'] == 'apartment'
cleaned_df.loc[apartment_mask, 'apartment_bedrooms'] = cleaned_df.loc[apartment_mask, 'apartment_bedrooms']\
    .fillna(cleaned_df.loc[apartment_mask, 'apartment_bedrooms'].median())
cleaned_df.loc[apartment_mask, 'apartment_bathrooms'] = cleaned_df.loc[apartment_mask, 'apartment_bathrooms']\
    .fillna(cleaned_df.loc[apartment_mask, 'apartment_bathrooms'].median())

# Villas → median per property_type
villa_mask = cleaned_df['property_type'] == 'villa'
cleaned_df.loc[villa_mask, 'apartment_bedrooms'] = cleaned_df.loc[villa_mask, 'apartment_bedrooms']\
    .fillna(cleaned_df.loc[villa_mask, 'apartment_bedrooms'].median())
cleaned_df.loc[villa_mask, 'apartment_bathrooms'] = cleaned_df.loc[villa_mask, 'apartment_bathrooms']\
    .fillna(cleaned_df.loc[villa_mask, 'apartment_bathrooms'].median())

# Houses → use apartment_rooms to estimate bedrooms, bathrooms
house_mask = cleaned_df['property_type'] == 'house'
# Fill bedrooms as 60% of rooms
cleaned_df.loc[house_mask & cleaned_df['apartment_bedrooms'].isna(), 'apartment_bedrooms'] = \
    (cleaned_df.loc[house_mask & cleaned_df['apartment_bedrooms'].isna(), 'apartment_rooms'] * 0.6).round()

# Fill bathrooms as 1 per 2 bedrooms, minimum 1
cleaned_df.loc[house_mask & cleaned_df['apartment_bathrooms'].isna(), 'apartment_bathrooms'] = \
    (cleaned_df.loc[house_mask & cleaned_df['apartment_bathrooms'].isna(), 'apartment_bedrooms'] / 2).clip(lower=1).round()

# Other → fill with median of other
other_mask = cleaned_df['property_type'] == 'other'
cleaned_df.loc[other_mask, 'apartment_bedrooms'] = cleaned_df.loc[other_mask, 'apartment_bedrooms']\
    .fillna(cleaned_df.loc[other_mask, 'apartment_bedrooms'].median())
cleaned_df.loc[other_mask, 'apartment_bathrooms'] = cleaned_df.loc[other_mask, 'apartment_bathrooms']\
    .fillna(cleaned_df.loc[other_mask, 'apartment_bathrooms'].median())

# Final check → fill any remaining NaNs with global median
cleaned_df['apartment_bedrooms'].fillna(cleaned_df['apartment_bedrooms'].median(), inplace=True)
cleaned_df['apartment_bathrooms'].fillna(cleaned_df['apartment_bathrooms'].median(), inplace=True)

# Cap extreme values per property type
max_bedrooms = {'studio':1, 'apartment':5, 'house':10, 'villa':15, 'land':0, 'other':5}
max_bathrooms = {'studio':1, 'apartment':3, 'house':5, 'villa':7, 'land':0, 'other':3}

cleaned_df['apartment_bedrooms'] = cleaned_df.apply(lambda row: min(row['apartment_bedrooms'], max_bedrooms.get(row['property_type'], 5)), axis=1)
cleaned_df['apartment_bathrooms'] = cleaned_df.apply(lambda row: min(row['apartment_bathrooms'], max_bathrooms.get(row['property_type'], 3)), axis=1)

cleaned_df.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_df['apartment_bedrooms'].fillna(cleaned_df['apartment_bedrooms'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cleaned_df['apartment_bathrooms'].fillna(cleaned_df['apartment_bathrooms'].median(), inplace=True)


title                            0
property_type                    0
country                          0
location                         0
building_construction_year       0
building_total_floors            0
apartment_floor                  0
apartment_rooms                  0
apartment_bedrooms               0
apartment_bathrooms              0
apartment_total_area             0
price_in_USD                  2575
dtype: int64

In [38]:
cleaned_df.dropna(subset=['price_in_USD'], inplace=True)