# üè†  Egypt Real Estate Listings

## Abstract  
##### This project focuses on analyzing and predicting real estate prices in Egypt using a dataset containing thousands of property listings. The dataset includes features such as location, area, number of rooms, bathrooms, and property type, along with price information. The goal of this project is to build a complete data analysis pipeline that automates data cleaning, preprocessing, visualization, and modeling using Python libraries like NumPy, Pandas, Matplotlib, and Seaborn. Insights from this dataset can help identify market trends, understand factors affecting housing prices, and support better real estate investment decisions.  

## Dataset Summary  
##### The dataset contains 19924 rows and 11 features & multiple property listings across different Egyptian cities and regions, with details about each property‚Äôs characteristics and pricing. It includes numerical features such as area and price, as well as categorical attributes like city, property type, and furnishing status. The dataset contains around several thousand rows and multiple columns describing each listing. Key problems detected include missing values, non-numeric symbols within numeric fields (e.g., ‚Äú1,200 EGP‚Äù, ‚Äú250 sqm‚Äù), inconsistent text formats, duplicated rows, and outliers in property prices and sizes. These issues will be resolved through preprocessing before performing exploratory analysis and machine learning modeling to predict property prices.


# Data Understanding & Profiling

## Step 1: import packages to use

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
%matplotlib inline

  from pandas_profiling import ProfileReport


In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)

## Step 2: Load the dataset

In [3]:
# Load dataset
df = pd.read_csv("egypt_real_estate_listings.csv", encoding='utf-8')

# Display basic info about the dataset
print("‚úÖ Dataset loaded successfully!")
print("Shape (rows, columns):", df.shape)
df.head()


‚úÖ Dataset loaded successfully!
Shape (rows, columns): (19924, 11)


Unnamed: 0,url,price,description,location,type,size,bedrooms,bathrooms,available_from,payment_method,down_payment
0,https://www.propertyfinder.eg/en/plp/buy/chale...,8000000,OWN A CHALET IN EL GOUNA WITH A PRIME LOCATION...,"Swan Lake Gouna, Al Gouna, Hurghada, Red Sea",Chalet,732 sqft / 68 sqm,1+ Maid,1,31 Aug 2025,Cash,"1,200,000 EGP"
1,https://www.propertyfinder.eg/en/plp/buy/villa...,25000000,"For sale, a villa with immediate delivery in C...","Karmell, New Zayed City, Sheikh Zayed City, Giza",Villa,"2,368 sqft / 220 sqm",4,4,2 Sep 2025,Cash,"2,100,000 EGP"
2,https://www.propertyfinder.eg/en/plp/buy/chale...,15135000,"With a down payment of EGP 1,513,000, a fully ...","Azha North, Ras Al Hekma, North Coast",Chalet,"1,270 sqft / 118 sqm",2,2,19 Aug 2025,Cash,"1,513,000 EGP"
3,https://www.propertyfinder.eg/en/plp/buy/apart...,12652000,Own an apartment in New Cairo with a minimal d...,"Taj City, 5th Settlement Compounds, The 5th Se...",Apartment,"1,787 sqft / 166 sqm",3,2,26 Aug 2025,Installments,"1,260,000 EGP"
4,https://www.propertyfinder.eg/en/plp/buy/villa...,45250000,Project: Granville\nLocation: Fifth Settlement...,"Granville, New Capital City, Cairo",Villa,"4,306 sqft / 400 sqm",7,7,2 Sep 2025,Cash,"2,262,500 EGP"


## Step 3: General Dataset Overview

In [4]:
# Display general information about the dataset (columns, data types, non-null counts)
df.info()

# Show statistical summary for both numeric and categorical columns
print("\nStatistical Summary (Numeric & Categorical):\n")
display(df.describe(include='all').T)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19924 entries, 0 to 19923
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   url             19924 non-null  object
 1   price           19385 non-null  object
 2   description     19846 non-null  object
 3   location        19833 non-null  object
 4   type            19847 non-null  object
 5   size            19847 non-null  object
 6   bedrooms        19780 non-null  object
 7   bathrooms       19784 non-null  object
 8   available_from  19261 non-null  object
 9   payment_method  19383 non-null  object
 10  down_payment    5445 non-null   object
dtypes: object(11)
memory usage: 1.7+ MB

Statistical Summary (Numeric & Categorical):



Unnamed: 0,count,unique,top,freq
url,19924,19924,https://www.propertyfinder.eg/en/plp/buy/chale...,1
price,19385,4286,10000000,307
description,19846,18130,Please Note Before Reading:\nThis is one of ma...,72
location,19833,1535,"Marassi, Sidi Abdel Rahman, North Coast",433
type,19847,17,Apartment,8355
size,19847,683,"1,507 sqft / 140 sqm",432
bedrooms,19780,18,3,4959
bathrooms,19784,16,3,6562
available_from,19261,353,1 Sep 2025,3254
payment_method,19383,2,Cash,15521


## Step 4: Check for Duplicates, Empty Columns, and Missing Values Summary

In [5]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Check for completely empty columns (all values are NaN)
empty_columns = [col for col in df.columns if df[col].isna().all()]
print(f"Completely empty columns: {empty_columns}")

# Count missing values per column
missing_values = df.isna().sum().sort_values(ascending=False)
print("\nMissing Values per Column:\n", missing_values)

# Calculate percentage of missing values per column
missing_percentage = (df.isna().mean() * 100).round(2).sort_values(ascending=False)
print("\nPercentage of Missing Values:\n", missing_percentage)


Number of duplicate rows: 0
Completely empty columns: []

Missing Values per Column:
 down_payment      14479
available_from      663
payment_method      541
price               539
bedrooms            144
bathrooms           140
location             91
description          78
type                 77
size                 77
url                   0
dtype: int64

Percentage of Missing Values:
 down_payment      72.67
available_from     3.33
payment_method     2.72
price              2.71
bedrooms           0.72
bathrooms          0.70
location           0.46
description        0.39
type               0.39
size               0.39
url                0.00
dtype: float64


## Step 5: Check Unique Values and Cardinality of Each Column

In [6]:
# Count the number of unique values in each column
unique_counts = df.nunique().sort_values(ascending=False)
print("Unique values per column:\n", unique_counts)

# Display columns with very low or very high uniqueness
high_cardinality = unique_counts[unique_counts > (0.9 * len(df))]
low_cardinality = unique_counts[unique_counts <= 5]

print("\nColumns with very high uniqueness (possible IDs or URLs):\n", high_cardinality)
print("\nColumns with very low uniqueness (might be categorical with few values):\n", low_cardinality)


Unique values per column:
 url               19924
description       18130
price              4286
down_payment       2030
location           1535
size                683
available_from      353
bedrooms             18
type                 17
bathrooms            16
payment_method        2
dtype: int64

Columns with very high uniqueness (possible IDs or URLs):
 url            19924
description    18130
dtype: int64

Columns with very low uniqueness (might be categorical with few values):
 payment_method    2
dtype: int64


## Step 6: Generate a full automated data profiling report for deeper exploration of the dataset.

In [7]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="data Report", explorative=True)

profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 11/11 [00:05<00:00,  1.90it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]