Rental data cleaning and exploration for the listings scraped in Tartu and Tallinn
In this notebook we look over the scraped data from kv.ee and discard any unimportant, unhelpful or unneccesary data

In [1]:
#Importing required libraries

import pandas as pd
import json
import numpy as np
import os
from pathlib import Path

#Display settings for better data viewing

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)


Listing all the files in our specific data folder

In [2]:
data_folder = Path('../../Data_11.04.25')

# List all JSON files in the data folder
json_files = list(data_folder.glob('*.json'))
print(f"Found {len(json_files)} JSON files:")
for f in json_files:
    print(f"  - {f.name}")

Found 4 JSON files:
  - scraped_listings_tln.json
  - scraped_listings_trt.json
  - scrape_tln.json
  - scrape_trt.json


Loading  the json files for listings and converting to DataFrame

In [3]:

json_files = [
    data_folder / 'scraped_listings_tln.json',
    data_folder / 'scraped_listings_trt.json'
]

print(f"Loading {len(json_files)} specific JSON files:")
for f in json_files:
    exists = "✓" if f.exists() else "✗ NOT FOUND"
    print(f"  {exists} {f.name}")

# Load all given JSON files
all_data = []
for file_path in json_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            all_data.extend(data)
            print(f"✓ Loaded {len(data)} records from {file_path.name}")
    except Exception as e:
        print(f"✗ Error loading {file_path.name}: {e}")

print(f"\nTotal records loaded: {len(all_data)}")

# Convert to DataFrame
df = pd.DataFrame(all_data)
print(f"DataFrame shape: {df.shape}")
print(f"Columns ({len(df.columns)}): {df.columns.tolist()}")

Loading 2 specific JSON files:
  ✓ scraped_listings_tln.json
  ✓ scraped_listings_trt.json
✓ Loaded 1359 records from scraped_listings_tln.json
✓ Loaded 450 records from scraped_listings_trt.json

Total records loaded: 1809
DataFrame shape: (1809, 20)
Columns (20): ['id', 'url', 'price', 'latitude', 'longitude', 'Üürida korter', 'Tube', 'Üldpind', 'Korrus/Korruseid', 'Ehitusaasta', 'Seisukord', 'Korruseid', 'Magamistube', 'Energiamärgis', 'Omandivorm', 'Ettemaks', 'Kulud suvel/talvel', 'Katastrinumber', 'Üürida korter (Broneeritud)', 'Registriosa number']


In [4]:
# checking the data types
df.head()
df.dtypes

id                             object
url                            object
price                          object
latitude                       object
longitude                      object
Üürida korter                  object
Tube                           object
Üldpind                        object
Korrus/Korruseid               object
Ehitusaasta                    object
Seisukord                      object
Korruseid                      object
Magamistube                    object
Energiamärgis                  object
Omandivorm                     object
Ettemaks                       object
Kulud suvel/talvel             object
Katastrinumber                 object
Üürida korter (Broneeritud)    object
Registriosa number             object
dtype: object

Missing values analysis

In [5]:
missing_stats = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum().values,
    'Missing_Pct': (df.isnull().sum().values / len(df) * 100).round(2),
    'Non_Null': df.notnull().sum().values
}).sort_values('Missing_Pct', ascending=False)

missing_stats

Unnamed: 0,Column,Missing_Count,Missing_Pct,Non_Null
18,Üürida korter (Broneeritud),1717,94.91,92
11,Korruseid,1707,94.36,102
19,Registriosa number,1662,91.87,147
15,Ettemaks,1406,77.72,403
16,Kulud suvel/talvel,1178,65.12,631
17,Katastrinumber,640,35.38,1169
14,Omandivorm,550,30.4,1259
12,Magamistube,513,28.36,1296
13,Energiamärgis,503,27.81,1306
9,Ehitusaasta,332,18.35,1477


Taking a look at columns with high missing rates (over 20%)

In [6]:
high_missing = missing_stats[missing_stats['Missing_Pct'] > 20]
print(f"Columns with >20% missing: {len(high_missing)}")
high_missing


Columns with >20% missing: 9


Unnamed: 0,Column,Missing_Count,Missing_Pct,Non_Null
18,Üürida korter (Broneeritud),1717,94.91,92
11,Korruseid,1707,94.36,102
19,Registriosa number,1662,91.87,147
15,Ettemaks,1406,77.72,403
16,Kulud suvel/talvel,1178,65.12,631
17,Katastrinumber,640,35.38,1169
14,Omandivorm,550,30.4,1259
12,Magamistube,513,28.36,1296
13,Energiamärgis,503,27.81,1306


Taking sample values from key columns and checking for duplicate ID's. For better understanding of the format of the data gotten.

In [7]:
key_cols = ['price', 'Tube', 'Üldpind', 'Korrus/Korruseid', 'Ehitusaasta', 'Seisukord']
for col in key_cols:
    if col in df.columns:
        print(f"\n{col}: {df[col].nunique()} unique | Sample:")
        print(df[col].dropna().head(5).tolist())

# Check for duplicate IDs
dups = df.duplicated(subset=['id']).sum()
print(f"Duplicate IDs: {dups}")
if dups > 0:
    df[df.duplicated(subset=['id'], keep=False)].sort_values('id')



price: 1571 unique | Sample:
['595\xa0€  8.75 €/m²', '365\xa0€  9.13 €/m²', '1\xa0100\xa0€  9.96 €/m²', '220\xa0€  14.7 €/m²', '850\xa0€  11.3 €/m²']

Tube: 6 unique | Sample:
['3', '1', '3', '1', '3']

Üldpind: 695 unique | Sample:
['68\xa0m²', '40\xa0m²', '110.4\xa0m²', '15\xa0m²', '75\xa0m²']

Korrus/Korruseid: 108 unique | Sample:
['6/9', '3/3', '2/3', '2/4', '2/5']

Ehitusaasta: 127 unique | Sample:
['1990', '1911', '1933', '2015', '2007']

Seisukord: 8 unique | Sample:
['San. remont tehtud', 'Renoveeritud', 'San. remont tehtud', 'Renoveeritud', 'Heas korras']
Duplicate IDs: 0


Making a clean copy of the dataset before cleaning

In [8]:
df_original = df.copy(deep=True)
print("✓ Original data backed up")

✓ Original data backed up


Data cleaning - counting the missing values and stripping whitespace

In [9]:
total_rows = len(df)
for column in df.columns:
    # Count "?" and NaN as missing
    total_missing = (df[column].astype(str).str.strip() == "?").sum() + df[column].isna().sum()
    if total_missing > 0:
        missing_pct = (total_missing/total_rows)*100
        print(f"{column}: {missing_pct.round(2)}%")
    else:
        print(f"{column}: no missing values")


id: no missing values
url: no missing values
price: 0.17%
latitude: 0.11%
longitude: 0.11%
Üürida korter: 5.2%
Tube: 0.55%
Üldpind: 0.17%
Korrus/Korruseid: 7.3%
Ehitusaasta: 18.35%
Seisukord: 3.48%
Korruseid: 94.36%
Magamistube: 28.36%
Energiamärgis: 27.81%
Omandivorm: 30.4%
Ettemaks: 77.72%
Kulud suvel/talvel: 65.12%
Katastrinumber: 35.38%
Üürida korter (Broneeritud): 94.91%
Registriosa number: 91.87%


Stripping whitespace from string columns

In [11]:
for column in df.columns:
    if df[column].dtype == 'object':
        stripped_series = df[column].str.strip()
        if not df[column].equals(stripped_series):
            df[column] = stripped_series
            print(f"✓ Stripped whitespace from {column}")

Replacing "?" with np.nan

In [12]:
print("Cleaning text data...")
for column in df.columns:
    if df[column].dtype == 'object':
        # Replace various forms of "?"
        df[column] = df[column].replace('?', np.nan)
        df[column] = df[column].replace(' ?', np.nan)
        df[column] = df[column].replace('? ', np.nan)
        # Replace non-breaking spaces (\xa0) with regular spaces
        df[column] = df[column].str.replace('\xa0', ' ', regex=False)

print("✓ Replaced '?' with NaN and cleaned non-breaking spaces")


Cleaning text data...
✓ Replaced '?' with NaN and cleaned non-breaking spaces


Data cleaning - Extract numeric price from format 450 € / 6.79€/meters squared

In [13]:
def extract_price(price_str):
    """Extract monthly rent from '450 € 6.79 €/m²' format"""
    if pd.isna(price_str):
        return np.nan
    try:
        # Remove all spaces and take first number before €
        clean_str = str(price_str).replace(' ', '').split('€')[0]
        return float(clean_str)
    except:
        return np.nan

df['price_clean'] = df['price'].apply(extract_price)
print(f"Price extraction: {df['price_clean'].notna().sum()} / {len(df)} successful")
print(f"Price range: {df['price_clean'].min():.0f} - {df['price_clean'].max():.0f} €")
print(f"Data type: {df['price_clean'].dtype}")


Price extraction: 1805 / 1809 successful
Price range: 1 - 7500 €
Data type: float64


Extracting square meters from same format

In [14]:
def extract_sqm(area_str):
    """Extract numeric area from '66.3 m²' format"""
    if pd.isna(area_str):
        return np.nan
    try:
        # Remove spaces and 'm²', then convert to float
        clean_str = str(area_str).replace(' ', '').replace('m²', '')
        return float(clean_str)
    except:
        return np.nan

df['area_sqm'] = df['Üldpind'].apply(extract_sqm)
print(f"Area extraction: {df['area_sqm'].notna().sum()} / {len(df)} successful")
print(f"Area range: {df['area_sqm'].min():.1f} - {df['area_sqm'].max():.1f} m²")

Area extraction: 1806 / 1809 successful
Area range: 7.0 - 264.0 m²


Split floor information - make "on which floor" and "how many floors on this house" separate

In [15]:
def extract_floor(floor_str):
    """Extract floor and total floors from '3/9' format"""
    if pd.isna(floor_str):
        return np.nan, np.nan
    try:
        parts = floor_str.split('/')
        floor = int(parts[0].strip())
        total = int(parts[1].strip())
        return floor, total
    except:
        return np.nan, np.nan

df[['floor', 'total_floors']] = df['Korrus/Korruseid'].apply(
    lambda x: pd.Series(extract_floor(x))
)
print(f"Floor extraction: {df['floor'].notna().sum()} / {len(df)} successful")

Floor extraction: 1677 / 1809 successful


Converting room numbers to numeric

In [16]:
df['rooms'] = pd.to_numeric(df['Tube'], errors='coerce')
print(f"Rooms: {df['rooms'].notna().sum()} / {len(df)} valid")
print(f"Unique room values: {sorted(df['rooms'].dropna().unique())}")

Rooms: 1799 / 1809 valid
Unique room values: [np.float64(1.0), np.float64(2.0), np.float64(3.0), np.float64(4.0), np.float64(5.0), np.float64(6.0)]


Clean construction year (change to numeric and check for outliers)

In [17]:
df['build_year'] = pd.to_numeric(df['Ehitusaasta'], errors='coerce')

# Validate build year - (e.g. there is an outlier with build year 1377)
# We assumed a  valid range is 1800-2025
df.loc[df['build_year'] < 1800, 'build_year'] = np.nan
df.loc[df['build_year'] > 2025, 'build_year'] = np.nan

print(f"Build year: {df['build_year'].notna().sum()} / {len(df)} valid")
if df['build_year'].notna().sum() > 0:
    print(f"Year range: {df['build_year'].min():.0f} - {df['build_year'].max():.0f}")
print(f"Removed {(pd.to_numeric(df['Ehitusaasta'], errors='coerce').notna() & df['build_year'].isna()).sum()} suspicious years")
print(f"Data type: {df['build_year'].dtype}")

Build year: 1469 / 1809 valid
Year range: 1807 - 2025
Removed 8 suspicious years
Data type: float64


Extracting deposit amounts

In [18]:
def extract_deposit(deposit_str):
    """Extract deposit from '900 €' format"""
    if pd.isna(deposit_str):
        return np.nan
    try:
        return float(deposit_str.replace('€', '').strip())
    except:
        return np.nan

df['deposit'] = df['Ettemaks'].apply(extract_deposit)

Validating coordinates

In [19]:
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')

# Estonia roughly: lat 57.5-59.7, lon 21.5-28.2
valid_coords = (
    (df['latitude'] >= 57) & (df['latitude'] <= 60) &
    (df['longitude'] >= 21) & (df['longitude'] <= 29)
)
print(f"Valid coordinates: {valid_coords.sum()} / {len(df)}")

Valid coordinates: 1807 / 1809


Selecting the columns to be kept - (everything with missing rate over 20% will be removed)

In [20]:
missing_threshold = 20  # Remove columns with >20% missing data

print("Evaluating ALL columns based on missing data threshold (>20%):")
print("="*60)

# Check all columns in the original dataframe
columns_to_keep = []
removed_columns = []

for col in df.columns:
    missing_pct = (df[col].isna().sum() / len(df)) * 100
    
    if missing_pct <= missing_threshold:
        columns_to_keep.append(col)
        print(f"✓ {col:35s} ({missing_pct:6.2f}% missing)")
    else:
        removed_columns.append(col)
        print(f"✗ {col:35s} ({missing_pct:6.2f}% missing) - REMOVED")

print("\n" + "="*60)
print(f"Kept {len(columns_to_keep)} columns, removed {len(removed_columns)} columns")
print(f"\nRemoved columns: {removed_columns}")
print("="*60)


Evaluating ALL columns based on missing data threshold (>20%):
✓ id                                  (  0.00% missing)
✓ url                                 (  0.00% missing)
✓ price                               (  0.17% missing)
✓ latitude                            (  0.11% missing)
✓ longitude                           (  0.11% missing)
✓ Üürida korter                       (  5.20% missing)
✓ Tube                                (  0.55% missing)
✓ Üldpind                             (  0.17% missing)
✓ Korrus/Korruseid                    (  7.30% missing)
✓ Ehitusaasta                         ( 18.35% missing)
✓ Seisukord                           (  3.48% missing)
✗ Korruseid                           ( 94.36% missing) - REMOVED
✗ Magamistube                         ( 28.36% missing) - REMOVED
✗ Energiamärgis                       ( 27.81% missing) - REMOVED
✗ Omandivorm                          ( 30.40% missing) - REMOVED
✗ Ettemaks                            ( 77.72% missing) -

Creating the cleaned DataFrame

In [21]:
df_filtered = df[columns_to_keep].copy()

print(f"\nFiltered dataset shape: {df_filtered.shape}")
print(f"Columns remaining: {df_filtered.columns.tolist()}")

# Map original columns to clean English names for the columns we kept
column_mapping = {
    'id': 'id',
    'url': 'url',
    'price': 'price_raw',
    'latitude': 'latitude',
    'longitude': 'longitude',
    'Üürida korter': 'rental_ad',
    'Tube': 'rooms_raw',
    'Üldpind': 'area_raw',
    'Korrus/Korruseid': 'floor_raw',
    'Ehitusaasta': 'build_year_raw',
    'Seisukord': 'condition',
    'Korruseid': 'total_floors_alt',
    'Magamistube': 'bedrooms',
    'Energiamärgis': 'energy_label',
    'Omandivorm': 'ownership_type',
    # Cleaned columns
    'price_clean': 'price',
    'area_sqm': 'area_sqm',
    'rooms': 'rooms',
    'floor': 'floor',
    'total_floors': 'total_floors',
    'build_year': 'build_year'
}

# Selecting only the cleaned numeric columns we created + essential info
final_columns = []
final_names = []

for col in df_filtered.columns:
    if col in ['id', 'url', 'latitude', 'longitude', 'Seisukord', 
               'price_clean', 'area_sqm', 'rooms', 'floor', 'total_floors', 'build_year']:
        final_columns.append(col)
        final_names.append(column_mapping.get(col, col))

df_clean = df_filtered[final_columns].copy()
df_clean.columns = final_names

print(f"\nFinal cleaned dataset with {len(df_clean.columns)} columns:")
print(f"Columns: {df_clean.columns.tolist()}")



Filtered dataset shape: (1809, 17)
Columns remaining: ['id', 'url', 'price', 'latitude', 'longitude', 'Üürida korter', 'Tube', 'Üldpind', 'Korrus/Korruseid', 'Ehitusaasta', 'Seisukord', 'price_clean', 'area_sqm', 'floor', 'total_floors', 'rooms', 'build_year']

Final cleaned dataset with 11 columns:
Columns: ['id', 'url', 'latitude', 'longitude', 'condition', 'price', 'area_sqm', 'floor', 'total_floors', 'rooms', 'build_year']


Checking cleaned data quality

In [22]:
print("Cleaned dataset summary:")
print(f"Total rows: {len(df_clean)}")
print(f"\nMissing values:")
print(df_clean.isnull().sum())
print(f"\nKey statistics:")
print(df_clean[['price', 'area_sqm', 'rooms', 'floor', 'build_year']].describe())

# Checking for unusual values
print("\n" + "="*60)
print("DATA QUALITY CHECKS:")
print("="*60)

# Checking floor values (keeping basement floors as they are valid)
print(f"\nFloor values < 0 (basements): {(df_clean['floor'] < 0).sum()}")
print("  → Keeping basement floors as valid data")

# Price range info
print(f"\nPrice distribution:")
print(f"  < 50€: {(df_clean['price'] < 50).sum()} listings → Will be removed (likely errors)")
print(f"  50-100€: {((df_clean['price'] >= 50) & (df_clean['price'] < 100)).sum()} listings")
print(f"  100-500€: {((df_clean['price'] >= 100) & (df_clean['price'] < 500)).sum()} listings")
print(f"  500-1000€: {((df_clean['price'] >= 500) & (df_clean['price'] < 1000)).sum()} listings")
print(f"  > 1000€: {(df_clean['price'] >= 1000).sum()} listings")

# Checking extreme areas
print(f"\nArea distribution:")
print(f"  < 15 m²: {(df_clean['area_sqm'] < 15).sum()} listings")
print(f"  15-50 m²: {((df_clean['area_sqm'] >= 15) & (df_clean['area_sqm'] < 50)).sum()} listings")
print(f"  50-100 m²: {((df_clean['area_sqm'] >= 50) & (df_clean['area_sqm'] < 100)).sum()} listings")
print(f"  > 100 m²: {(df_clean['area_sqm'] >= 100).sum()} listings")

Cleaned dataset summary:
Total rows: 1809

Missing values:
id                0
url               0
latitude          2
longitude         2
condition        63
price             4
area_sqm          3
floor           132
total_floors    132
rooms            10
build_year      340
dtype: int64

Key statistics:
             price     area_sqm        rooms        floor   build_year
count  1805.000000  1806.000000  1799.000000  1677.000000  1469.000000
mean    712.840443    51.629623     2.121178     3.231962  1986.581348
std     506.231734    28.643769     0.921135     2.374131    34.777646
min       1.000000     7.000000     1.000000    -1.000000  1807.000000
25%     444.000000    33.000000     1.000000     2.000000  1963.000000
50%     590.000000    46.900000     2.000000     3.000000  1995.000000
75%     800.000000    63.775000     3.000000     4.000000  2018.000000
max    7500.000000   264.000000     6.000000    30.000000  2025.000000

DATA QUALITY CHECKS:

Floor values < 0 (basements):

Removing outliers - Properties where price, area_sqm or rooms aren't specified or if they seem...odd

In [23]:
critical_fields = ['price', 'area_sqm', 'rooms']
df_final = df_clean.dropna(subset=critical_fields)
print(f"Rows after removing records missing critical fields: {len(df_final)} (removed {len(df_clean) - len(df_final)})")

# Remove unrealistic prices (< 50€) - likely data errors or non-rental listings
before_price_filter = len(df_final)
df_final = df_final[df_final['price'] >= 50]
print(f"Rows after removing prices < 50€: {len(df_final)} (removed {before_price_filter - len(df_final)})")

print(f"\nFinal dataset size: {len(df_final)} rows")


Rows after removing records missing critical fields: 1797 (removed 12)
Rows after removing prices < 50€: 1793 (removed 4)

Final dataset size: 1793 rows


In [24]:
output_path = Path('../../Cleaned_csvs/listings_cleaned.csv')
df_final.to_csv(output_path, index=False, encoding='utf-8')
print(f"✓ Cleaned data saved to: {output_path}")
print(f"Final dataset: {df_final.shape}")

✓ Cleaned data saved to: ..\..\Cleaned_csvs\listings_cleaned.csv
Final dataset: (1793, 11)
