<a href="https://colab.research.google.com/github/HazemmoAlsady/-End-to-End-Predictive-Analytics-Project/blob/main/Forecasting_Property_Prices_Using_Regression_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Final Project: Real Estate Price Prediction**

Course: Forecast & Predictive Analytics

**Problem Statement**

The goal of this project is to build a predictive model to estimate real estate prices based on property characteristics. Accurate price prediction helps investors and decision-makers better understand market behavior.

In [6]:
import pandas as pd
import numpy as np


**Dataset Loading**

The dataset contains real estate sales data with multiple features related to property characteristics.
The target variable is **price.**

In [7]:
from google.colab import files


df = pd.read_csv('USA Housing Dataset.csv', sep=';')
df.head()


Unnamed: 0,"date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country"
0,"2014-05-09 00:00:00,376000.0,3.0,2.0,1340,1384..."
1,"2014-05-09 00:00:00,800000.0,4.0,3.25,3540,159..."
2,"2014-05-09 00:00:00,2238888.0,5.0,6.5,7270,130..."
3,"2014-05-09 00:00:00,324000.0,3.0,2.25,998,904,..."
4,"2014-05-10 00:00:00,549900.0,5.0,2.75,3060,701..."


In [13]:
df.head()

Unnamed: 0,"date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country"
0,"2014-05-09 00:00:00,376000.0,3.0,2.0,1340,1384..."
1,"2014-05-09 00:00:00,800000.0,4.0,3.25,3540,159..."
2,"2014-05-09 00:00:00,2238888.0,5.0,6.5,7270,130..."
3,"2014-05-09 00:00:00,324000.0,3.0,2.25,998,904,..."
4,"2014-05-10 00:00:00,549900.0,5.0,2.75,3060,701..."


**Dataset Overview**

We start by understanding the size, structure, and data types of the dataset.

In [8]:
df.shape
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4140 entries, 0 to 4139
Data columns (total 1 columns):
 #   Column                                                                                                                                                           Non-Null Count  Dtype 
---  ------                                                                                                                                                           --------------  ----- 
 0   date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country  4140 non-null   object
dtypes: object(1)
memory usage: 32.5+ KB


**Missing Values & Duplicate Check**

Ensuring data quality before modeling.

In [9]:
df.isnull().sum()
df.duplicated().sum()


np.int64(0)

**Descriptive Statistics**

This step helps understand data distribution and detect abnormal values.

In [10]:
df.describe()


Unnamed: 0,"date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country"
count,4140
unique,4140
top,"2014-07-10 00:00:00,220600.0,3.0,2.5,1490,8102..."
freq,1


**Data Cleaning**

Some records contain unrealistic price values (price = 0), which must be removed.

In [12]:
df = df[df['Price'] > 0]


KeyError: 'Price'

**Feature Reduction**

Textual columns that do not directly contribute to prediction are removed to simplify the model.

In [None]:
df = df.drop(columns=['street', 'country'])


**Feature Selection**

Textual columns are excluded to avoid unnecessary preprocessing steps.

In [None]:
X = df.drop(columns=['price', 'date', 'city', 'statezip'])
y = df['price']


**Train-Test Split**

The dataset is split into training and testing sets using a 70%-30% ratio.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


**Model 1: Linear Regression**

Linear Regression is used as a baseline model to capture linear relationships.

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)


**Model Evaluation (Linear Regression)**

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred_lr = lr_model.predict(X_test)

mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

mae_lr, rmse_lr, r2_lr


**Model 2: Decision Tree Regressor**

Decision Tree is used to capture non-linear relationships.

In [None]:
from sklearn.tree import DecisionTreeRegressor

dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)


**Model Evaluation (Decision Tree)**

In [None]:
y_pred_dt = dt_model.predict(X_test)

mae_dt = mean_absolute_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
r2_dt = r2_score(y_test, y_pred_dt)

mae_dt, rmse_dt, r2_dt


**Model Comparison**

The models are compared based on MAE, RMSE, and RÂ² metrics.

In [None]:
comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree'],
    'MAE': [mae_lr, mae_dt],
    'RMSE': [rmse_lr, rmse_dt],
    'R2': [r2_lr, r2_dt]
})

comparison


**Final Conclusion**

Linear Regression outperformed Decision Tree in terms of prediction accuracy and generalization ability.
Although more complex models may improve performance, Linear Regression was the most suitable model within the scope of this project.

**Prediction Example**

The best-performing model is used to predict property prices for unseen data.

In [None]:
sample_input = X_test.iloc[[0]]
predicted_price = lr_model.predict(sample_input)
predicted_price
