# FairPrice Check: Real Estate Price Predictor and Anomaly Detector

# Business Understanding

## Business Problem

The real estate market is characterized by significant information asymmetry. Buyers and renters often lack the expertise and market visibility needed to determine whether a listed property price is fair relative to similar properties in the same location. At the same time, sellers and agents may unintentionally overprice or underprice properties due to reliance on intuition, incomplete comparable listings, or outdated market information.

These pricing inefficiencies can result in prolonged time on the market, failed negotiations, financial losses, and reduced trust in the housing ecosystem. The problem is especially pronounced in urban housing markets, where property prices vary widely based on location, size, property type, and available amenities, making price fairness difficult to assess for non-expert participants.

Rather than predicting an exact market value, there is a practical need for a **decision-support system**.


## Proposed Solution

This project proposes a **supervised machine learning classification system** that learns pricing patterns from historical real estate data and property features such as:

- Location
- Property type  
- Bedrooms, bathrooms, and toilets  
- Amenities and listing characteristics  
- Listing category (for rent / for sale)  

The model will classify each property into one of three pricing categories:

- **Underpriced** – Listed significantly below comparable market listings  
- **Fairly Priced** – Listed within a reasonable range of comparable market listings  
- **Overpriced** – Listed significantly above comparable market listings  

Instead of generating a numeric price estimate, the system focuses on **relative price fairness**, making it more interpretable and directly actionable for users. This approach provides an objective, data-driven benchmark that reduces reliance on subjective judgment and improves pricing transparency.


## Business Objectives

1.**Classify property listings as underpriced, fairly priced, or overpriced**  
  using historical real estate data and supervised machine learning.

2.**Support better pricing decisions for buyers, renters, and sellers**  
  by providing a clear and interpretable price fairness label.

3.**Improve market transparency in real estate pricing**  
  by reducing reliance on subjective pricing judgments.

4.**Enable scalable pricing analysis across cities and property types**  
  through a reusable and retrainable machine learning pipeline.


## Success Criteria

1. **Achieve strong classification performance**  
  - Overall Accuracy ≥ **75–80%**  
  - Macro F1-score ≥ **0.75**  
  - Precision and Recall ≥ **0.70** for *underpriced* and *overpriced* classes  

2. **Provide interpretable model outputs**  
  - Feature importance or explanations available  
  - Clear, human-readable pricing labels


## Business Value

### Buyers & Renters  
- Identify good deals and avoid overpriced listings  
- Improve negotiation leverage  
- Reduce financial risk and search time  

### Sellers & Property Owners  
- Set competitive prices based on market evidence  
- Reduce prolonged time on market  
- Improve chances of successful transactions  

### Real Estate Agents & PropTech Platforms  
- Add an objective, data-backed pricing indicator to listings  
- Enhance user trust and platform credibility  
- Differentiate platforms with intelligent pricing insights  

### Policy Makers & Urban Planning Analysts (Future Use)  
- Analyze housing affordability trends  
- Identify spatial price distortions  
- Support evidence-based housing policy development  


## Methodology Justification

This project intentionally reframes the pricing task from a regression problem (predicting an exact market price) into a multi-class classification problem (underpriced, fairly priced, overpriced). While numeric price predictions can be sensitive to noise, outliers, and incomplete feature coverage, especially in heterogeneous housing markets. Stakeholders typically care more about whether a listing is reasonably priced relative to comparable properties than about a precise price estimate. A classification-based approach provides more interpretable, actionable outputs for non-technical users and aligns more directly with real-world decision-making, such as identifying good deals or avoiding overpriced listings. Additionally, class-based labeling is more robust to market volatility and data quality issues, making the system more stable and practical for deployment in real-world real estate platforms.

# Data Understanding

## Data Source

The dataset used in this project was sourced from **Kenya Property Centre**, a major online real estate listing platform in Kenya. It contains property listings for both rental and sale markets across different regions, with a strong focus on urban and peri-urban areas.

Each row in the dataset represents a single property lis

In [1]:
# import libraries

import os
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer # Import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, recall_score, precision_score, f1_score
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report


In [2]:
#import dataset
df = pd.read_csv("data/raw/kenya_listings.csv")
df.head()


Unnamed: 0,id,price,price_qualifier,bedrooms,bathrooms,toilets,furnished,serviced,shared,parking,category,type,sub_type,state,locality,sub_locality,listdate
0,1,5000.0,per month,3,0,,0,0,0,0,For Rent,Apartment,,Nairobi,Embakasi,Tassia,2020-07-18 00:00:00
1,2,12500000.0,,4,3,,0,0,0,0,For Sale,House,Detached Duplex,Kajiado,Kitengela,,2020-07-18 00:00:00
2,3,19500000.0,,5,5,,0,0,0,0,For Sale,House,Detached Duplex,Kajiado,Kitengela,,2020-07-18 00:00:00
3,4,19500000.0,,4,0,,0,0,0,4,For Sale,House,Detached Duplex,Kajiado,Kitengela,,2020-07-18 00:00:00
4,5,19500000.0,,5,5,,0,0,0,0,For Sale,House,Townhouse,Kajiado,Kitengela,,2020-07-18 00:00:00


In [3]:
df.shape

(16117, 17)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16117 entries, 0 to 16116
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               16117 non-null  int64  
 1   price            16114 non-null  float64
 2   price_qualifier  7247 non-null   object 
 3   bedrooms         16117 non-null  int64  
 4   bathrooms        16117 non-null  int64  
 5   toilets          6089 non-null   float64
 6   furnished        16117 non-null  int64  
 7   serviced         16117 non-null  int64  
 8   shared           16117 non-null  int64  
 9   parking          16117 non-null  int64  
 10  category         16117 non-null  object 
 11  type             16117 non-null  object 
 12  sub_type         6300 non-null   object 
 13  state            16117 non-null  object 
 14  locality         16111 non-null  object 
 15  sub_locality     1046 non-null   object 
 16  listdate         16117 non-null  object 
dtypes: float64(2

In [6]:
#copy data
df1=df.copy()
df1.shape

(16117, 17)

## Dataset Structure

The dataset consists of the following key variables:

### Price Information
- **price**: The listed price of the property. For rental listings, this typically represents the monthly rent, while for sale listings it represents the asking sale price.
- **price_qualifier**: Additional context for the price (e.g., “per month”), primarily relevant for rental listings.

### Property Characteristics
- **bedrooms**: Number of bedrooms in the property.
- **bathrooms**: Number of bathrooms.
- **toilets**: Number of toilets.
- **parking**: Number of available parking spaces.
- **furnished**: Binary indicator showing whether the property is furnished.
- **serviced**: Indicates whether the property is serviced.
- **shared**: Indicates whether the property is shared with other occupants.

### Property Classification
- **category**: Indicates whether the property is *For Sale* or *For Rent*.
- **type**: Broad property type (e.g., House, Apartment).
- **sub_type**: More detailed property classification (e.g., Townhouse, Detached Duplex).

### Location Information
- **state**: County or major administrative region (e.g., Nairobi, Kajiado).
- **locality**: Town or area within the county (e.g., Embakasi, Kitengela).
- **sub_locality**: More granular neighborhood information where available.

### Listing Metadata
- **id**: Unique identifier for each listing.
- **listdate**: Date the property was listed on the platform.


## Relevance to the Business Problem

The dataset captures key determinants of property pricing in the Kenyan real estate market, particularly location, property size, and property type. Similar properties within the same locality often display significant price variation, indicating potential pricing inefficiencies.

This makes the dataset well-suited for developing a **pricing fairness classification model** that flags underpriced and overpriced listings to support better market transparency and decision-making.

# 1. Handling Missing Numerical Data

Before performing any analysis, it is crucial to identify missing values in the dataset, especially for numerical features that will be used in modeling.

In this step, we:

1. **Count missing values** for each column using `isna().sum()`.
2. **Calculate the percentage of missing values** relative to the total number of records.
3. **Combine the counts and percentages** into a single DataFrame for easier inspection.

This allows us to flag columns with significant missing data and decide whether to fill them  or drop them.  

# 1. Handling Missing Numerical Data

Before performing any analysis, it is crucial to identify missing values in the dataset, especially for numerical features that will be used in modeling.

In this step, we:

1. **Count missing values** for each column using `isna().sum()`.
2. **Calculate the percentage of missing values** relative to the total number of records.
3. **Combine the counts and percentages** into a single DataFrame for easier inspection.

This allows us to flag columns with significant missing data and decide whether to fill them  or drop them.  

In [7]:
#df.select_dtypes(exclude=['object']).isnull().sum()
count=df1.isna().sum()
percent=((count/df1.shape[0])*100)
null=pd.DataFrame(pd.concat([count,percent],keys=['Missing values','% Missing values'],axis=1))
null

Unnamed: 0,Missing values,% Missing values
id,0,0.0
price,3,0.018614
price_qualifier,8870,55.035056
bedrooms,0,0.0
bathrooms,0,0.0
toilets,10028,62.220016
furnished,0,0.0
serviced,0,0.0
shared,0,0.0
parking,0,0.0


### 1a) Handling Missing Values in the `toilets` Column

The toilets feature has a significant number of missing values (~62%). This column is important because the number of toilets is a key indicator of property size and amenities, which directly affects pricing.

To handle the missing data, we fill the missing values with 1 for bedsitters and with the **median** for the rest.

- **Why median?**  
  - The median is robust to outliers, which is important because some properties may have an unusually high number of toilets.  
  - Using the mean could skew the data if there are extreme values (e.g., luxury properties with many toilets).  

This approach preserves the overall distribution of the toilets feature while ensuring that the dataset contains no missing values.

In [8]:
# Fill missing values with median

df1.loc[df1['toilets'].isna() & (df1['sub_type']=='Bedsitter (Single Room)'), 'toilets'] = 1
df1['toilets'] = df1['toilets'].fillna(df1['toilets'].median())

### 1b) Handling Missing Values in the `price` Column

The price column is our **target variable**, representing the listed price of each property. There are only 3 missing values in this column (~0.02%), which is a very small fraction of the dataset.

Because price is the variable we are trying to predict, **we do not impute these missing values**, as that could bias the model. Instead, we drop the rows with missing prices to ensure the model only learns from valid data.

This will preserve data integrity and ensures reliable predictions for property price anomaly detection.

In [9]:
# Remove all rows where the price is missing

df1 = df1.dropna(subset=['price'])

# 2. Dropping duplicates

In [10]:
#removing duplicates in the dataset

df1=df1.drop_duplicates()
df1.shape

(16114, 17)

# 3.Fixing Structural Errors

Structural errors are inconsistencies in categorical columns that could confuse the model. These include typos, rare categories, and inconsistent naming conventions. 

In [11]:
df1.price_qualifier.value_counts()

price_qualifier
per month                       6227
per calendar month               373
per plot                         228
per acre                         128
per square foot / per month      112
per day                           62
per square meter / per month      55
per square foot / per week        46
per hectare                        6
per square meter / per week        5
per square foot / per annum        3
per square foot                    1
per square meter                   1
Name: count, dtype: int64

In [12]:
df1.type.value_counts()

type
House                   6024
Apartment               5840
Land                    2668
Commercial Property     1580
Event Centre / Venue       2
Name: count, dtype: int64

In [13]:
df1.sub_type.value_counts()

sub_type
Townhouse                               1563
Office Space                             872
Residential Land                         810
Detached Duplex                          630
Detached Bungalow                        579
Mixed-use Land                           376
Semi-detached Bungalow                   286
Mini Flat                                244
Commercial Land                          205
Bedsitter (Single Room)                  191
Warehouse                                186
Semi-detached Duplex                      74
Terraced Duplex                           55
Shop                                      45
Plaza / Complex / Mall                    45
Block of Flats                            30
Terraced Bungalow                         29
Hotel / Guest House                       27
Restaurant / Bar                          24
Industrial Land                           14
School                                     5
Factory                                    3
F

In [14]:
df1.state.value_counts()

state
Nairobi          8934
Kiambu           2694
Kajiado          1293
Mombasa          1124
Machakos          534
Kilifi            480
Nakuru            290
Kisumu            150
Laikipia           97
Kwale              78
Embu               44
Uasin Gishu        43
Meru               42
Nyeri              39
Nandi              38
Makueni            30
Baringo            28
Muranga            27
Bungoma            21
Kirinyaga          17
Kitui              13
Nyandarua          12
Kericho            11
Kakamega           10
Trans Nzoia        10
Kisii               9
Isiolo              7
Lamu                6
Busia               6
Narok               4
Tharaka-Nithi       3
Bomet               3
Homa Bay            3
Garissa             2
Migori              2
Siaya               2
Samburu             2
Vihiga              2
Taita Taveta        1
West Pokot          1
Marsabit            1
Turkana             1
Name: count, dtype: int64

In [15]:
df1.locality.value_counts()

locality
Westlands          2446
Kilimani           1199
Kikuyu             1064
Lavington           891
Karen               749
                   ... 
Silale                1
Ting'Ang'A            1
Migwani               1
Kadzandani            1
Lodwar Township       1
Name: count, Length: 357, dtype: int64

In [16]:
df1.sub_locality.value_counts()

sub_locality
Runda              553
Loresho            107
South C             83
Industrial Area     80
Thigio              34
Muthaiga North      23
Old Muthaiga        22
Chiromo             20
South B             19
Rimpa               18
Tassia              13
Yukos               11
Imara Daima         10
Githurai 44         10
Rosslyn              9
Clay City            8
Githurai 45          5
Umoja Phase 1        5
Lucky Summer         4
New Muthaiga         3
Umoja Phase 2        3
Mukuru Village       1
Kariba               1
Kihingo              1
Kwa Njenga           1
Lindi                1
Kiembeni             1
Name: count, dtype: int64

In [17]:
df1.category.value_counts()

category
For Sale         9105
For Rent         6916
Short Let          84
Joint Venture       9
Name: count, dtype: int64

### 3a) Correcting Bedsitter Listings

Some properties in the dataset have **0 bedrooms** but a missing `sub_type` (NaN).  

- In Kenya, such properties are typically **bedsitters (single-room units)**.  
- We update the `sub_type` for these listings to `'Bedsitter (Single Room)'`.  

This ensures that the dataset accurately reflects property characteristics, which is crucial for modeling property prices correctly and detecting anomalies.
We should change those apartment properties that have zero bedrooms to become bedsitter sub type

In [18]:
# Reclassify properties with 0 bedrooms and a missing sub_type as Bedsitters (Single Room)

df1.loc[(df['bedrooms']==0)&(df['sub_type']=='Missing'),'sub_type']='Bedsitter (Single Room)'

### 3b) Correcting Property Type for Subtypes

The `sub_type` column provides more detailed property classifications. Some properties are listed as `'Block of Flats'` but their broad `type` is marked as `'House'`.  

Since a block of flats is logically an **apartment**, we update the `type` column for all `'Block of Flats'` entries to `'Apartment'`.  

This correction ensures consistency between `sub_type` and `type`, improving the accuracy of downstream analysis and modeling.


In [19]:
# Update property type for all 'Block of Flats' listings to 'Apartment'

df1.loc[(df['sub_type']=='Block of Flats'),'type']='Apartment'

### 4a) Cleaning and Standardizing the `price_qualifier` Column

The `price_qualifier` column provides context for how the listed price should be interpreted (e.g., per month, per day, per acre). To make this feature consistent and ready for modeling we will:

1. **Label all missing categorical values**  
   - Any missing values (NaN) in categorical columns are filled with the string `'Missing'` so that they can be handled consistently in later steps.

2. **Standardize similar categories**  
   - All `'per calendar month'` values are replaced with `'per month'` to reduce redundant categories.

3. **Assign meaningful values based on property category**  
   - For rental properties (`category == 'For Rent'`) with missing qualifiers, we assign `'per month'`.  
   - For sale properties (`category == 'For Sale'`) with missing qualifiers, we assign `'Sale'`.

This ensures that the `price_qualifier` column is clean, consistent, and aligned with the property type, improving data quality for machine learning.

In [20]:
#per calendar month values are replaced with per month

df1.loc[(df1['price_qualifier']=='per calendar month'),'price_qualifier']='per month'

In [21]:
df1.loc[df1['price_qualifier'].isna(), 'category'].value_counts()

category
For Sale         8741
For Rent          117
Joint Venture       9
Name: count, dtype: int64

In [22]:
#Labelling missing categorical features

df1.select_dtypes(include=['object']).isnull().sum()

price_qualifier     8867
category               0
type                   0
sub_type            9814
state                  0
locality               6
sub_locality       15068
listdate               0
dtype: int64

In [23]:
# Fill all missing values in categorical (object) columns with the string 'Missing'
# so that they can be safely used in analysis and machine learning

for column in df1.select_dtypes(include=['object']).columns:
    if df1[column].isna().sum() > 0:
        df1.loc[:, column] = df1[column].fillna('Missing')

In [24]:
df1.select_dtypes(include=['object']).isnull().sum()

price_qualifier    0
category           0
type               0
sub_type           0
state              0
locality           0
sub_locality       0
listdate           0
dtype: int64

In [25]:
# For rental properties with missing price qualifiers, assign 'per month' to ensure consistency

df1.loc[(df1['category']=='For Rent')&(df1['price_qualifier']=='Missing'),'price_qualifier']='per month'

In [26]:
#Assigning the price qualifier 'Sale' to properties listed as 'For Sale' where the qualifier was missing

df1.loc[(df1['category']=='For Sale')&(df1['price_qualifier']=='Missing'),'price_qualifier']='Sale'

### 4b) Standardizing Property Category

Some properties in the dataset are listed as `'Short Let'`, which is essentially a **rental property**.  

To maintain consistency and simplify analysis, we update all `'Short Let'` entries in the `category` column to `'For Rent'`.  

This ensures that all rental properties are grouped under a single category, making feature engineering and modeling more straightforward.


In [27]:
#Reclassify all properties listed as 'Short Let' to 'For Rent' to ensure consistent rental category labeling

df1.loc[(df1['category']=='Short Let'),'category']='For Rent'

# 5.Dropping duplicates

In [28]:
#removing duplicates in the dataset

df1=df1.drop_duplicates()
df1.shape

(16114, 17)

# 6. Dropping unwanted/Irrelevant observations

### 6a) Removing Irrelevant Property Types

Not all properties in the dataset are relevant for our analysis. Since our focus is on **residential** (properties people can live in), we will remove listings that fall into the following categories:


- **Event Centres / Venues**
- **Commercial property**
- **Land**

These property types are excluded because they are **not residential** and therefore not relevant for predicting rental or sale prices of homes like apartments or houses.  

By filtering out these listings, we ensure that the dataset is **focused, clean, and suitable for building an accurate pricing anomaly model**.

In [29]:
# Keep only residential properties (Houses and Apartments) and remove all other types

df1=df1.loc[(df1['type']=='House')| (df1['type']=='Apartment')]
df1.shape

(11864, 17)

### 6b) Removing Joint Venture Properties

Some properties in the dataset are marked with a **'Joint Venture'** attribute in the `category` column.  

These properties are not standard residential listings and may have atypical pricing structures.  
To maintain consistency and improve model accuracy, we will remove all listings with the 'Joint Venture' category from the dataset.

In [31]:
# Remove properties categorized as 'Joint Venture' since they are not standard residential listings

df1=df1.loc[(df1['category']!='Joint Venture')]
df1.shape

(11862, 17)

### 6c) Inspecting Properties with 'per day' Price Qualifier

Some properties in the dataset have a `price_qualifier` of `'per day'`.  

These short-term rental listings often have atypical pricing compared to standard monthly or sale prices.  
Since our focus is on typical rental and sale properties, we identify these listings so they can be removed from the dataset.

In [32]:
# Check how many properties have 'per day' as price_qualifier before removing them

df1.loc[(df1['price_qualifier']=='per day')].shape

(61, 17)

In [33]:
# Remove all properties with 'per day' price qualifier since they are short-term rentals
# Confirm that no 'per day' listings remain in the dataset

df1=df1.loc[(df1['price_qualifier'])!='per day']
df1.loc[(df1['price_qualifier']=='per day')].shape

(0, 17)

### 6d) Removing Properties with Price of Zero

Some listings in the dataset have a `price` value of **zero**, which is unrealistic for both rental and sale properties.  

These entries could be errors, incomplete listings, or placeholders, and including them would distort our analysis and model predictions.  
We remove all properties with a price of zero to ensure the dataset contains only valid, meaningful pricing information.

In [34]:
## Check how many properties have a price of zero

df1.loc[(df['price']==0)].shape

(59, 17)

In [35]:
# Remove all listings with price == 0
# Confirm that no zero-priced properties remain in the dataset

df1=df1.loc[(df1['price'])!=0]
df1.loc[(df['price']==0)].shape

(0, 17)

### 6e) Removing Properties with No Toilets and No Bathrooms

Some properties in the dataset have **both `toilets` and `bathrooms` equal to zero**, which is unrealistic for residential use.  

These listings are likely incomplete or erroneous.  
We remove all properties where **both toilets and bathrooms are zero**, while retaining properties that have at least one toilet or bathroom, as some may share facilities.

In [36]:
# Check how many properties have both toilets and bathrooms equal to zero

df1.loc[(df['toilets']==0)&(df['bathrooms']==0)].shape

(2, 17)

In [37]:
# Remove listings where both toilets and bathrooms are zero
# Confirm that no such properties remain in the dataset

df1=df1.loc[~((df1['toilets']==0)&(df1['bathrooms']==0))]
df1.loc[(df1['toilets']==0)&(df1['bathrooms']==0)].shape

(0, 17)

### 6f) Removing Properties with Missing Locality

Some properties in the dataset have a missing value in the `locality` column.  

Since location is a **critical factor** for property pricing, listings without a specified locality cannot be used reliably in analysis or modeling.  
We remove all properties where `locality` is missing to ensure our dataset contains only geospatially valid listings.

In [38]:
# Check how many properties have 'Missing' as their locality

df1.loc[(df1['locality']=='Missing')].shape

(4, 17)

In [39]:
# Remove all listings with missing locality
# Confirm that no properties with missing locality remain

df1=df1.loc[~(df1['locality']=='Missing')]
df1.loc[(df1['locality']=='Missing')].shape

(0, 17)

### 6g) Removing Inconsistent Bedsitter Listings

A bedsitter (single room) property should have **0 bedrooms**, since it is a single-room unit.  

Some listings in the dataset have `sub_type` marked as `'Bedsitter (Single Room)'` but have a non-zero number of bedrooms, which is inconsistent.  

We identify these inconsistent rows and remove them from the dataset to maintain data integrity, ensuring that property features accurately reflect their type.

In [40]:
# Check how many bedsitter listings have a non-zero number of bedrooms

df1.loc[(df1['sub_type']=='Bedsitter (Single Room)')&(df1['bedrooms']!=0)].shape


(5, 17)

In [41]:
# Remove bedsitter listings where bedrooms != 0 to ensure consistency
# Confirm that no inconsistent bedsitter listings remain

df1=df1.loc[~((df1['sub_type']=='Bedsitter (Single Room)')&(df1['bedrooms']!=0))]
df1.loc[(df1['sub_type']=='Bedsitter (Single Room)')&(df1['bedrooms']!=0)].shape

(0, 17)

### 6h) Removing House Listings with Zero Bedrooms

Some properties are listed as `type = 'House'` but have **zero bedrooms**, which is unrealistic for residential houses.  

We first inspect which `sub_type` values have this inconsistency and then remove all such rows from the dataset.  
This ensures that the `bedrooms` feature accurately reflects the type of property, improving the quality and reliability of the dataset for modeling.


In [42]:
# Check which house subtypes have zero bedrooms

df1.loc[(df1['bedrooms']==0)&(df1['type']=='House')].sub_type.value_counts()

sub_type
Missing              124
Townhouse             41
Detached Duplex       20
Detached Bungalow      8
Terraced Duplex        1
Name: count, dtype: int64

In [43]:
# Remove house listings where bedrooms == 0 to maintain realistic property features
# Confirm that no houses with zero bedrooms remain in the dataset

df1=df1.loc[~((df['bedrooms']==0)&(df1['type']=='House'))]
df1.loc[(df['bedrooms']==0)&(df1['type']=='House')].shape

(0, 17)

In [44]:
#replace rental price for this property

df1.loc[(df1['id']==554,'price')]=28000