<a href="https://colab.research.google.com/github/AIsoroush/deep-learning-projects/blob/main/HousePriceipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Real Estate Price Prediction

**Author:** Soroush Taghados
\
**Project Type:** Regression / Machine Learning  

---

## Project Overview
This project predicts real estate prices using a dataset containing property features such as area, number of rooms, and location. A **Linear Regression** model is trained to estimate property prices accurately.

---

## Data Preprocessing
- Missing values removed.  
- Textual features (e.g., `Address`) encoded using **One-Hot Encoding**.  
- Numerical features scaled using **StandardScaler**.  
- Outliers filtered based on area per room and price per area.

---

## Model Pipeline
The project uses **scikit-learn Pipeline** to combine preprocessing and modeling, ensuring reproducibility and clean workflow.

---

## Model Evaluation
- **Metrics:** Mean Squared Error (MSE) and R² score.  
- Model achieved **excellent performance** on the test set, accurately predicting property prices.

---

## How to Use
1. Clone this repository.  
2. Ensure required libraries are installed (`tensorflow`, `pandas`, `numpy`, `scikit-learn`, etc.).  
3. Run `main.py` (or your Jupyter Notebook) to train and evaluate the model.  
4. Use the provided Google Drive link to download the dataset.

---

## Dataset
The dataset includes features such as:
- `Area` (numeric)  
- `Room` (numeric)  
- `Address` (categorical)  
- `Price` (target)

Dataset can be downloaded [here](https://drive.google.com/uc?id=1jsbjvaITnAPPJ7LAEmnfyHOEvORmoJfw).

---

## Results
The trained Linear Regression model predicts real estate prices with high accuracy.



In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
import seaborn as sns
%matplotlib inline

In [22]:
# -------------------------------
# Download dataset
# -------------------------------
import gdown
import os

os.makedirs("data", exist_ok=True)

file_id = "1o1uxkpKkGsDkjKoIxfQLcO-sBSP3tMXs"
url = f"https://drive.google.com/uc?id={file_id}"  # Direct download link

out_path = "data/drug_dataset.csv"

print("Downloading dataset...")
gdown.download(url, out_path, quiet=False)
print(f"✅ Dataset downloaded to {out_path}")

file = out_path


Downloading dataset...


Downloading...
From: https://drive.google.com/uc?id=1o1uxkpKkGsDkjKoIxfQLcO-sBSP3tMXs
To: /content/data/drug_dataset.csv
100%|██████████| 190k/190k [00:00<00:00, 6.55MB/s]

✅ Dataset downloaded to data/drug_dataset.csv





In [23]:
# -------------------------------
# Import dataset from CSV file
# -------------------------------
import pandas as pd

data = pd.read_csv(file)  # Load dataset into a pandas DataFrame

# Preview the first 5 rows to verify the data
data.head()

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Price,Price(USD)
0,63,1,True,True,True,Shahran,1850000000.0,61666.67
1,60,1,True,True,True,Shahran,1850000000.0,61666.67
2,79,2,True,True,True,Pardis,550000000.0,18333.33
3,95,2,True,True,True,Shahrake Qods,902500000.0,30083.33
4,123,2,True,True,True,Shahrake Gharb,7000000000.0,233333.33


In [24]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3479 entries, 0 to 3478
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Area        3479 non-null   object 
 1   Room        3479 non-null   int64  
 2   Parking     3479 non-null   bool   
 3   Warehouse   3479 non-null   bool   
 4   Elevator    3479 non-null   bool   
 5   Address     3456 non-null   object 
 6   Price       3479 non-null   float64
 7   Price(USD)  3479 non-null   float64
dtypes: bool(3), float64(2), int64(1), object(2)
memory usage: 146.2+ KB


In [25]:
# -------------------------------
# Remove any rows with missing values
# -------------------------------
data.dropna(inplace=True)


In [26]:
# -------------------------------
# Display a concise summary of the DataFrame
# Shows number of non-null entries, datatypes, and memory usage
# -------------------------------
data.info()


<class 'pandas.core.frame.DataFrame'>
Index: 3456 entries, 0 to 3478
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Area        3456 non-null   object 
 1   Room        3456 non-null   int64  
 2   Parking     3456 non-null   bool   
 3   Warehouse   3456 non-null   bool   
 4   Elevator    3456 non-null   bool   
 5   Address     3456 non-null   object 
 6   Price       3456 non-null   float64
 7   Price(USD)  3456 non-null   float64
dtypes: bool(3), float64(2), int64(1), object(2)
memory usage: 172.1+ KB


In [27]:
# -------------------------------
# Count the number of occurrences of each unique value in the 'Address' column
# Useful to see the distribution of addresses in the dataset
# -------------------------------
data.Address.value_counts()


Unnamed: 0_level_0,count
Address,Unnamed: 1_level_1
Punak,161
Pardis,146
West Ferdows Boulevard,145
Gheitarieh,141
Shahran,130
...,...
Firoozkooh,1
Shadabad,1
Naziabad,1
Javadiyeh,1


In [28]:
# -------------------------------
# Remove leading and trailing spaces from each value in the 'Address' column
# This ensures consistency before encoding or analysis
# -------------------------------
data['Address'] = data['Address'].apply(lambda x: x.strip())


In [29]:
# -------------------------------
# Count the number of occurrences of each unique 'Address'
# and sort them in descending order
# Gives a clear view of which addresses are most frequent
# -------------------------------
location_state = data.groupby('Address')['Address'].agg('count').sort_values(ascending=False)
location_state


Unnamed: 0_level_0,Address
Address,Unnamed: 1_level_1
Punak,161
Pardis,146
West Ferdows Boulevard,145
Gheitarieh,141
Shahran,130
...,...
Telecommunication,1
Villa,1
Varamin - Beheshti,1
Yakhchiabad,1


In [30]:
# -------------------------------
# Identify addresses that appear 10 times or fewer in the dataset
# These rare categories might be grouped together or treated differently
# -------------------------------
location_less_than_10 = location_state[location_state <= 10]
location_less_than_10

Unnamed: 0_level_0,Address
Address,Unnamed: 1_level_1
Ozgol,10
Gholhak,10
Air force,10
Zafar,10
Araj,9
...,...
Telecommunication,1
Villa,1
Varamin - Beheshti,1
Yakhchiabad,1


In [31]:
# -------------------------------
# Replace rare addresses (appearing <= 10 times) with 'other'
# This reduces sparsity and helps the model generalize better
# -------------------------------
data['Address'] = data['Address'].apply(lambda x: 'other' if x in location_less_than_10 else x)


In [32]:
# -------------------------------
# Display the updated counts of each unique value in 'Address'
# Useful to verify that rare addresses have been grouped into 'other'
# -------------------------------
data['Address'].value_counts()


Unnamed: 0_level_0,count
Address,Unnamed: 1_level_1
other,432
Punak,161
Pardis,146
West Ferdows Boulevard,145
Gheitarieh,141
...,...
Rudhen,11
West Pars,11
Qalandari,11
Qazvin Imamzadeh Hassan,11


In [33]:
# -------------------------------
# Display all unique values in the 'Area' column
# Helps to understand the different categories before encoding
# -------------------------------
data.Area.unique()


array(['63', '60', '79', '95', '123', '70', '87', '59', '54', '71', '68',
       '64', '136', '155', '140', '42', '93', '65', '99', '105', '160',
       '77', '110', '100', '90', '49', '96', '67', '62', '55', '129',
       '109', '58', '150', '130', '88', '51', '113', '98', '75', '61',
       '72', '122', '215', '101', '53', '74', '114', '151', '300', '76',
       '148', '40', '128', '94', '97', '137', '85', '78', '48', '82',
       '120', '139', '66', '80', '44', '50', '121', '141', '127', '180',
       '158', '144', '245', '190', '108', '117', '200', '125', '236',
       '220', '86', '84', '106', '320', '154', '210', '124', '83', '270',
       '104', '103', '165', '135', '132', '81', '153', '166', '175',
       '170', '115', '118', '116', '43', '230', '91', '126', '450', '500',
       '145', '112', '192', '164', '265', '92', '143', '350', '335',
       '235', '225', '221', '312', '188', '198', '650', '179', '256',
       '257', '167', '246', '168', '280', '69', '400', '660', '213', '

In [34]:
# -------------------------------
# Convert 'Area' column to numeric values
# 'errors="coerce"' will replace non-numeric entries with NaN
# Then drop rows with NaN to clean the data
# Finally, convert the column to integer type for consistency
# -------------------------------
data['Area'] = pd.to_numeric(data['Area'], errors='coerce')
data.dropna(inplace=True)
data['Area'] = data['Area'].astype(int)


In [35]:
#check again
data.Area.unique()

array([ 63,  60,  79,  95, 123,  70,  87,  59,  54,  71,  68,  64, 136,
       155, 140,  42,  93,  65,  99, 105, 160,  77, 110, 100,  90,  49,
        96,  67,  62,  55, 129, 109,  58, 150, 130,  88,  51, 113,  98,
        75,  61,  72, 122, 215, 101,  53,  74, 114, 151, 300,  76, 148,
        40, 128,  94,  97, 137,  85,  78,  48,  82, 120, 139,  66,  80,
        44,  50, 121, 141, 127, 180, 158, 144, 245, 190, 108, 117, 200,
       125, 236, 220,  86,  84, 106, 320, 154, 210, 124,  83, 270, 104,
       103, 165, 135, 132,  81, 153, 166, 175, 170, 115, 118, 116,  43,
       230,  91, 126, 450, 500, 145, 112, 192, 164, 265,  92, 143, 350,
       335, 235, 225, 221, 312, 188, 198, 650, 179, 256, 257, 167, 246,
       168, 280,  69, 400, 660, 213,  57, 102, 133,  73, 134, 191, 282,
        89, 111, 147, 157, 283, 863, 415, 173, 162, 156, 171, 261,  45,
       161,  46, 107, 420, 131, 185, 250, 216, 680, 750, 202, 138,  38,
        56, 197,  52, 365, 181, 146, 240, 142, 303, 203, 204, 25

In [36]:
# -------------------------------
# Display all unique values in the 'Room' column
# -------------------------------
data.Room.unique()


array([1, 2, 3, 0, 4, 5])

In [37]:
# -------------------------------
# Display the counts of each unique value in the 'Price' column
# Helps to understand the distribution of prices in the dataset
# -------------------------------
data.Price.value_counts()


Unnamed: 0_level_0,count
Price,Unnamed: 1_level_1
2.000000e+09,40
2.200000e+09,36
3.500000e+09,36
3.000000e+09,34
1.200000e+09,32
...,...
1.792000e+09,1
2.410000e+09,1
2.035000e+09,1
6.350000e+08,1


In [38]:
# -------------------------------
# Display the counts of each unique value in the 'Price(USD)' column
# Useful to understand the distribution of car prices in USD
# -------------------------------
data['Price(USD)'].value_counts()

Unnamed: 0_level_0,count
Price(USD),Unnamed: 1_level_1
66666.67,40
116666.67,36
73333.33,36
100000.00,34
40000.00,32
...,...
23516.67,1
661933.33,1
380000.00,1
1055600.00,1


In [39]:
# -------------------------------
# Generate descriptive statistics of the dataset
# Includes count, mean, standard deviation, min, max, and quartiles for numeric columns
# Helps to quickly understand data distribution and spot potential outliers
# -------------------------------
data.describe()

Unnamed: 0,Area,Room,Price,Price(USD)
count,3450.0,3450.0,3450.0,3450.0
mean,106.917391,2.081159,5375563000.0,179185.4
std,69.550976,0.760216,8125918000.0,270863.9
min,30.0,0.0,3600000.0,120.0
25%,69.0,2.0,1419250000.0,47308.33
50%,90.0,2.0,2900000000.0,96666.67
75%,120.0,2.0,6000000000.0,200000.0
max,929.0,5.0,92400000000.0,3080000.0


In [40]:
# -------------------------------
# Create a new feature 'area_per_room' by dividing 'Area' by 'Room'
# This gives a per-room area metric which can be useful for modeling
# -------------------------------
data['area_per_room'] = data['Area'] / data['Room']

In [41]:
data.head()

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Price,Price(USD),area_per_room
0,63,1,True,True,True,Shahran,1850000000.0,61666.67,63.0
1,60,1,True,True,True,Shahran,1850000000.0,61666.67,60.0
2,79,2,True,True,True,Pardis,550000000.0,18333.33,39.5
3,95,2,True,True,True,Shahrake Qods,902500000.0,30083.33,47.5
4,123,2,True,True,True,Shahrake Gharb,7000000000.0,233333.33,61.5


In [42]:
data.area_per_room.describe()

Unnamed: 0,area_per_room
count,3450.0
mean,inf
std,
min,11.6
25%,40.666667
50%,48.0
75%,57.0
max,inf


In [43]:
# -------------------------------
# Filter the dataset to keep only rows where 'area_per_room' is at least 30
# Helps to remove unrealistic or very small values that may skew the model
# -------------------------------
data2 = data[data['area_per_room'] >= 30]

In [44]:
data2.area_per_room.describe()

Unnamed: 0,area_per_room
count,3393.0
mean,inf
std,
min,30.0
25%,41.0
50%,48.333333
75%,57.0
max,inf


In [45]:
data2.head()

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Price,Price(USD),area_per_room
0,63,1,True,True,True,Shahran,1850000000.0,61666.67,63.0
1,60,1,True,True,True,Shahran,1850000000.0,61666.67,60.0
2,79,2,True,True,True,Pardis,550000000.0,18333.33,39.5
3,95,2,True,True,True,Shahrake Qods,902500000.0,30083.33,47.5
4,123,2,True,True,True,Shahrake Gharb,7000000000.0,233333.33,61.5


In [46]:
# -------------------------------
# Create a new feature 'price_per_area' by dividing 'Price' by 'Area'
# Round the result to 2 decimal places
# This gives a metric of price per unit area, useful for modeling
# -------------------------------
data2['price_per_area'] = round(data2['Price'] / data2['Area'], 2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['price_per_area'] = round(data2['Price'] / data2['Area'], 2)


In [47]:
data2.price_per_area.describe()

Unnamed: 0,price_per_area
count,3393.0
mean,41593780.0
std,31741310.0
min,22500.0
25%,20317460.0
50%,35000000.0
75%,55000000.0
max,416666700.0


In [48]:
# -------------------------------
# Filter the dataset to keep only rows where 'price_per_area' is at least 25,000,000
# This removes extremely low-valued entries that might skew the model
# -------------------------------
data3 = data2[data2['price_per_area'] >= 2.5e7]

In [49]:
# -------------------------------
# Find the row(s) with the maximum 'price_per_area'
# Useful to identify the most expensive per-unit-area entries in the dataset
# -------------------------------
data3[data3['price_per_area'] == data3['price_per_area'].max()]

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Price,Price(USD),area_per_room,price_per_area
3394,54,1,True,True,True,West Ferdows Boulevard,22500000000.0,750000.0,54.0,416666700.0


In [50]:
# -------------------------------
# Filter the dataset to remove extremely high 'price_per_area' values above 230,000,000
# This step eliminates extreme outliers that could negatively impact model training
# -------------------------------
data3 = data3[data3['price_per_area'] <= 2.3e8]

In [51]:
data3

Unnamed: 0,Area,Room,Parking,Warehouse,Elevator,Address,Price,Price(USD),area_per_room,price_per_area
0,63,1,True,True,True,Shahran,1.850000e+09,61666.67,63.000000,29365079.37
1,60,1,True,True,True,Shahran,1.850000e+09,61666.67,60.000000,30833333.33
4,123,2,True,True,True,Shahrake Gharb,7.000000e+09,233333.33,61.500000,56910569.11
5,70,2,True,True,False,North Program Organization,2.050000e+09,68333.33,35.000000,29285714.29
7,59,1,True,True,True,Shahran,2.150000e+09,71666.67,59.000000,36440677.97
...,...,...,...,...,...,...,...,...,...,...
3472,113,3,True,True,True,Ostad Moein,3.170000e+09,105666.67,37.666667,28053097.35
3473,63,1,True,True,False,Feiz Garden,1.890000e+09,63000.00,63.000000,30000000.00
3474,86,2,True,True,True,Southern Janatabad,3.500000e+09,116666.67,43.000000,40697674.42
3475,83,2,True,True,True,Niavaran,6.800000e+09,226666.67,41.500000,81927710.84


In [52]:
# -------------------------------
# Drop unnecessary or derived columns that are no longer needed
# - 'area_per_room' and 'price_per_area' were intermediate features for filtering
# - 'Elevator', 'Warehouse', 'Parking' might be irrelevant for the current model
# -------------------------------
data3.drop(columns=['area_per_room', 'price_per_area', 'Elevator', 'Warehouse', 'Parking'], axis=1, inplace=True)


In [53]:
data3

Unnamed: 0,Area,Room,Address,Price,Price(USD)
0,63,1,Shahran,1.850000e+09,61666.67
1,60,1,Shahran,1.850000e+09,61666.67
4,123,2,Shahrake Gharb,7.000000e+09,233333.33
5,70,2,North Program Organization,2.050000e+09,68333.33
7,59,1,Shahran,2.150000e+09,71666.67
...,...,...,...,...,...
3472,113,3,Ostad Moein,3.170000e+09,105666.67
3473,63,1,Feiz Garden,1.890000e+09,63000.00
3474,86,2,Southern Janatabad,3.500000e+09,116666.67
3475,83,2,Niavaran,6.800000e+09,226666.67


In [54]:
# -------------------------------
# Import essential libraries for preprocessing, model evaluation, and pipeline creation
# - OneHotEncoder: to convert categorical features into numeric vectors
# - StandardScaler: to scale numerical features for better model performance
# - train_test_split: to split dataset into training and testing sets
# - make_pipeline & make_column_transformer: to create preprocessing pipelines
# - r2_score: to evaluate regression model performance
# -------------------------------
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import r2_score

In [55]:
# -------------------------------
# Create a column transformer for preprocessing
# - StandardScaler for numerical features ('Area' and 'Room') to normalize their values
# - OneHotEncoder for categorical feature ('Address') to convert it into numeric format
# - remainder='passthrough' ensures that other columns are left unchanged
# -------------------------------
scaler = StandardScaler()

col_trans = make_column_transformer(
    (StandardScaler(), ['Area', 'Room']),
    (OneHotEncoder(handle_unknown='ignore', sparse_output=False), ['Address']),
    remainder='passthrough'
)


In [64]:
# -------------------------------
# Initialize a Linear Regression model
# - Using scikit-learn's LinearRegression for predicting continuous target variable
# -------------------------------
from sklearn import linear_model
lr = linear_model.LinearRegression()

In [57]:
# -------------------------------
# Create a machine learning pipeline
# - col_trans: preprocess numerical and categorical features
# - lr: Linear Regression model for prediction
# - Note: scaler is already included for numerical features in col_trans,
#         so no need to add an extra StandardScaler in the pipeline
# -------------------------------
# model = make_pipeline(col_trans, scaler, lr)  # alternative if scaler separate
model = make_pipeline(col_trans, lr)


In [58]:
# -------------------------------
# Define input features (X) and target variable (y)
# - data_input: all columns except 'Price' (features for prediction)
# - data_output: 'Price' column (target variable)
# -------------------------------
data_input = data3.drop(columns=['Price'])
data_output = data3['Price']


In [59]:
# -------------------------------
# Split the dataset into training and testing sets
# - test_size=0.2: 20% of data reserved for testing
# - x_train, y_train: training features and target
# - x_test, y_test: testing features and target
# -------------------------------
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_input, data_output, test_size=0.2, random_state=42)


In [60]:
# -------------------------------
# Fit the pipeline (preprocessing + Linear Regression model) on training data
# - x_train: input features for training
# - y_train: target values for training
# - The pipeline ensures that preprocessing (scaling, one-hot encoding) is applied automatically
# -------------------------------
model.fit(x_train, y_train)


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [65]:
# -------------------------------
# Evaluate the performance of the trained model on the test set
# - x_test: input features for testing
# - y_test: true target values for testing
# - model.score() returns the R^2 score for regression models
# -------------------------------
score = model.score(x_test, y_test)
print(f"R^2 score on test data: {score:.2f}")

R^2 score on test data: 1.00


In [66]:
# -------------------------------
# Generate predictions on the test set using the trained model
# - x_test: input features for testing
# - prd: predicted target values (prices)
# -------------------------------
prd = model.predict(x_test)

In [68]:
# -------------------------------
# Evaluate the model's performance on the test set
# - MSE (Mean Squared Error): average squared difference between predicted and actual values
# - R2 score: proportion of variance in the dependent variable explained by the model
# -------------------------------
from sklearn.metrics import r2_score
import numpy as np

print('MSE: %.2f' % np.mean((prd - y_test) ** 2))
print('R2: %2f' % r2_score(y_test, prd))

MSE: 6124.46
R2: 1.000000
