# Nairobi House Price Prediction – Day 2  
Data Cleaning & Feature Engineering + Basic EDA

Goal: Make the raw data model-ready  
- Remove duplicates
- Handle missing values
- Standardize location names
- Convert size units
- Remove extreme outliers

New features:
- price_per_sqft
- amenity_score (limited – our 'amenities' is empty after rename, so basic count or skip)
- month (from listing date) – date is empty, so placeholder constant month (e.g. 2 for Feb)
- Optional: distance_to_cbd_km (added with dict mapping)

Output: clean_listings.csv + EDA visuals

In [1]:
import pandas as pd
import numpy as np
import re # for size parsing
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Data Loading
df = pd.read_csv('../data/raw/clean_listings_buyrentkenya_2026-02-18.csv')
print("Loaded shape:",df.shape)
print("\nColumns:", df.columns.tolist())
print("\nFirst 10 rows")
print(df.head(10))
print("\nSum of Missing values:")
print(df.isna().sum())
print("\nData types:")
print(df.dtypes)

Loaded shape: (407, 7)

Columns: ['Location', 'Property Type', 'Bedrooms', 'Bathrooms', 'Size', 'Price', 'Source_URL']

First 10 rows
             Location Property Type  Bedrooms  Bathrooms    Size        Price  \
0  Thigiri, Westlands         house         6        6.0     NaN  260000000.0   
1         Kiambu Road         house         4        NaN     NaN   78000000.0   
2           Lavington     townhouse         6        7.0     NaN  160000000.0   
3           Lavington         villa         5        5.0     NaN   60000000.0   
4           Lavington         villa         5        6.0     NaN   60000000.0   
5           Lavington         villa         5        5.0     NaN   85000000.0   
6           Lavington         villa         5        6.0     NaN   85000000.0   
7           Lavington         villa         5        9.0  485 m²   85000000.0   
8           Lavington         villa         5        6.0     NaN   95000000.0   
9           Lavington         villa         5        5.0

In [4]:
# Removing rows that are 100% identical 
df = df.drop_duplicates()

print("Shape of the Data Frame After removing duplicate rows:")
print("\nShape:",df.shape)

Shape of the Data Frame After removing duplicate rows:

Shape: (399, 7)


In [7]:
# Handling Missing value
# Bedroom & Bathroom: fill missing with median
df['Bedrooms'] = df['Bedrooms'].fillna(df['Bedrooms'].median())
df['Bathrooms'] = df['Bathrooms'].fillna(df['Bathrooms'].median())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Bedrooms'] = df['Bedrooms'].fillna(df['Bedrooms'].median())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Bathrooms'] = df['Bathrooms'].fillna(df['Bathrooms'].median())


In [None]:
# Price: droping the 2 missing 
df = df.dropna(subset=['Price'])


In [12]:
# Property Type: Fill any missing with 'House'
df['Property Type'] = df['Property Type'].fillna('house')

print("\nAfter Handling Missing values:")
print(df.isna().sum())


After Handling Missing values:
Location           0
Property Type      0
Bedrooms           0
Bathrooms          0
Size             326
Price              0
Source_URL         0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Property Type'] = df['Property Type'].fillna('house')


In [13]:
# Clean locations: remove quotes, strip spaces,standadize common ones
# Replacing double quotes with single one then rmoving leading and trailing white spaces using .str.strip()

df['Location'] = df['Location'].str.replace("",'').str.strip()

df['Location'] = df['Location'].str.replace(', Nairobi', '')
# Since the location column has a redudant , Nairobi suffix we will remove it so that
# "Runda, Westlands, Nairobi" → "Runda, Westlands"

# Converting every location to title case(first letter is in capital)
df['Location'] = df['Location'].str.title()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Location'] = df['Location'].str.replace("",'').str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Location'] = df['Location'].str.replace(', Nairobi', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Location'] = df['Location'].str.title()


In [16]:
# Standardization of Location for easier grouping
df['Location'] = df['Location'].replace({
    # Westlands cluster variations
    'Runda, Westlands': 'Runda',
    'Lower Kabete, Westlands': 'Lower Kabete',
    'Westlands Area, Westlands': 'Westlands',
    'Westlands Area': 'Westlands',
    'Brookside, Westlands': 'Brookside',
    'Nyari, Westlands': 'Nyari',
    'Kyuna, Westlands': 'Kyuna',
    'Spring Valley, Westlands': 'Spring Valley',
    'Rosslyn, Westlands': 'Rosslyn',
    'Loresho, Westlands': 'Loresho',
    'Parklands, Westlands': 'Parklands',
    'Riverside, Westlands': 'Riverside',

    # Karen cluster
    'Karen Hardy': 'Karen',
    'Karen, Nairobi': 'Karen',

    # Other frequent areas
    'Kitisuru, Westlands': 'Kitisuru',
    'Muthaiga': 'Muthaiga',
    'Kilimani': 'Kilimani',
    'Kileleshwa': 'Kileleshwa',
    'Garden Estate, Roysambu': 'Garden Estate',
    'Dagoretti Corner': 'Dagoretti Corner',
    'Waiyaki Way, Westlands': 'Waiyaki Way',

    
})

print("\nUnique locations after standardize:", df['Location'].nunique())
print("\nTop 20 locations:")
print(df['Location'].value_counts().head(20))




Unique locations after standardize: 31

Top 20 locations:
Location
Lavington             168
Runda                  43
Karen                  38
Loresho                17
Kileleshwa             16
Kiambu Road            15
Westlands              10
Brookside              10
Nyari                   9
Lower Kabete            9
Kitisuru                7
Kilimani                6
Muthaiga                6
Riverside               6
Rosslyn                 4
Kyuna                   4
Waiyaki Way             4
Spring Valley           4
Thigiri, Westlands      3
Buruburu                2
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Location'] = df['Location'].replace({
