# Real Estate Price Prediction

## 1. Introduction

This project uses a real estate dataset alongside machine learning techniques to predict the prices of future real estate in that area. 
The dataset consists of real estate data from Madrid, Spain. 

<img src="images\repp-intro.png" alt="Confusion matrix" width="500">


**Motivation and Impact:**
- Provide data-driven insights for investors who are curious in buying/selling properties
- Assist real estate agencies with market analysis and strategy design
- Help homebuyers and sellers evaluate property prices to ensure lower-risk and better informed decisions
- Uncover relationships between real estate characteristics that explain price fluctuation

**Goal**

The main goal is to derive valuable data-driven insights from the dataset using advanced visualization techniques and a variety of machine learning techniques. 

**Dataset**

We use the [Madrid real estate market](https://www.kaggle.com/datasets/mirbektoktogaraev/madrid-real-estate-market) dataset on [Kaggle](https://www.kaggle.com/).

**Workflow**

1. x
2. y
3. z

**Tools**

- Python 3.11
- Pandas, Numpy, Matplotlib, Seaborn
- Scikit-learn
- Etc.

**Intended Outcome**

By the end of the project, we'll have found the ML model that best generalizes to unforseen data, as well as insights to which features most strongly influence real estate pricing.



In [11]:
import numpy as np
import pandas as pd

In [12]:
df = pd.read_csv("dataset/houses_Madrid.csv")
df.shape
df.head().transpose()


Unnamed: 0,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
id,21742,21741,21740,21739,21738
title,"Piso en venta en calle de Godella, 64",Piso en venta en calle de la del Manojo de Rosas,"Piso en venta en calle del Talco, 68",Piso en venta en calle Pedro Jiménez,Piso en venta en carretera de Villaverde a Val...
subtitle,"San Cristóbal, Madrid","Los Ángeles, Madrid","San Andrés, Madrid","San Andrés, Madrid","Los Rosales, Madrid"
sq_mt_built,64.0,70.0,94.0,64.0,108.0
sq_mt_useful,60.0,,54.0,,90.0
n_rooms,2,3,2,2,2
n_bathrooms,1.0,1.0,2.0,1.0,2.0
n_floors,,,,,
sq_mt_allotment,,,,,


In [13]:
df.drop(["id", "title", "latitude", "street_number", "street_name", "n_floors", "sq_mt_allotment", "longitude", "is_exact_address_hidden", "raw_address", "portal", "is_floor_under", "door", "operation", "rent_price_by_area", "is_rent_price_known", "is_buy_price_known", "are_pets_allowed", "is_furnished", "is_kitchen_equipped", "has_private_parking", "has_public_parking",   ], axis=1, inplace=True)
#Reconsider if "is_parking_included_in_price" and "parking_price" should be kept in df

In [14]:
false_cols = []
true_cols = []

for col in df.columns:
    uniques = set(df[col].dropna().unique())  
    
    if uniques == {False}:
        false_cols.append(col)
    elif uniques == {True}:
        true_cols.append(col)

print(false_cols)
print(true_cols)

[]
['has_ac', 'has_fitted_wardrobes', 'has_garden', 'has_pool', 'has_terrace', 'has_balcony', 'has_storage_room', 'is_accessible', 'has_green_zones']


In [15]:
df[true_cols] = df[true_cols].astype(bool).fillna(False).astype(bool)

In [16]:
#Improving the data's represention (such as in house_type_id, from "HouseType 1: Pisos" to "Apartment")

df = df.replace({"TRUE": True, "FALSE": False, "True": True, "False": False}).infer_objects(copy=False)
bool_columns = [col for col in df.columns if set(df[col].dropna().unique()) <= {True, False, "FALSE", "TRUE", "True", "False"}]
df[bool_columns] = df[bool_columns].astype(bool).replace({True: 1, False: 0}).infer_objects(copy=False).astype('Int8')



  df[bool_columns] = df[bool_columns].astype(bool).replace({True: 1, False: 0}).infer_objects(copy=False).astype('Int8')


In [17]:

print((df.isna().sum(axis=1) >= 9).sum())
df.isna().sum(axis=0)


0


Unnamed: 0                          0
subtitle                            0
sq_mt_built                       126
sq_mt_useful                    13514
n_rooms                             0
n_bathrooms                        16
floor                            2607
neighborhood_id                     0
rent_price                          0
buy_price                           0
buy_price_by_area                   0
house_type_id                     391
is_renewal_needed                   0
is_new_development                  0
built_year                      11742
has_central_heating                 0
has_individual_heating              0
has_ac                              0
has_fitted_wardrobes                0
has_lift                            0
is_exterior                         0
has_garden                          0
has_pool                            0
has_terrace                         0
has_balcony                         0
has_storage_room                    0
is_accessibl

In [18]:
mapping = {
    "HouseType 1: Pisos": "apartment",
    "HouseType 5: Áticos": "penthouse",
    "HouseType 4: Dúplex": "duplex",
    "HouseType 2: Casa o chalet": "house"
}

df["house_type_id"] = df["house_type_id"].replace(mapping)
df["house_type_id"].value_counts()

df["floor"] = df["floor"].replace({"Bajo": 0})
df["floor"] = pd.to_numeric(df["floor"], errors="coerce").astype("Int16")

In [19]:
print(df.shape)
print(df.describe())
df.head().transpose()

(21742, 36)
         Unnamed: 0   sq_mt_built  sq_mt_useful       n_rooms   n_bathrooms  \
count  21742.000000  21616.000000   8228.000000  21742.000000  21726.000000   
mean   10870.500000    146.920892    103.458192      3.005749      2.091687   
std     6276.519112    134.181865     88.259192      1.510497      1.406992   
min        0.000000     13.000000      1.000000      0.000000      1.000000   
25%     5435.250000     70.000000     59.000000      2.000000      1.000000   
50%    10870.500000    100.000000     79.000000      3.000000      2.000000   
75%    16305.750000    162.000000    113.000000      4.000000      2.000000   
max    21741.000000    999.000000    998.000000     24.000000     16.000000   

          floor    rent_price     buy_price  buy_price_by_area  \
count   18740.0  2.174200e+04  2.174200e+04       21742.000000   
mean   2.667236 -5.917031e+04  6.537356e+05        4020.523871   
std    2.038491  9.171162e+05  7.820821e+05        1908.418774   
min         

Unnamed: 0,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
subtitle,"San Cristóbal, Madrid","Los Ángeles, Madrid","San Andrés, Madrid","San Andrés, Madrid","Los Rosales, Madrid"
sq_mt_built,64.0,70.0,94.0,64.0,108.0
sq_mt_useful,60.0,,54.0,,90.0
n_rooms,2,3,2,2,2
n_bathrooms,1.0,1.0,2.0,1.0,2.0
floor,3,4,1,0,4
neighborhood_id,Neighborhood 135: San Cristóbal (1308.89 €/m2)...,Neighborhood 132: Los Ángeles (1796.68 €/m2) -...,Neighborhood 134: San Andrés (1617.18 €/m2) - ...,Neighborhood 134: San Andrés (1617.18 €/m2) - ...,Neighborhood 133: Los Rosales (1827.79 €/m2) -...
rent_price,471,666,722,583,1094
buy_price,85000,129900,144247,109900,260000


In [20]:
df.to_csv("dataset/houses_Madrid_modified.csv", index=False)