                                                                #INTRODUCTION
                                                    #REAL ESTATE SALES PRICE PREDICTION 


PROBLEM STATEMENT

The real estate market plays a crucial role in the economy, with property prices being influenced by numerous factors such as location, property size, amenities, and market trends. Accurately predicting real estate prices is essential for buyers, sellers, investors, and policymakers. This project aims to develop a predictive model for real estate sales prices using a comprehensive dataset of past transactions, property features, and market conditions. Various regression models will be implemented and evaluated to determine the best-performing model for price prediction.

PROBLEM DESCRIPTION

The dataset consists of historical real estate sales data, including property characteristics, transaction details, and geographical information. Predicting real estate prices is challenging due to the complex relationships among various factors that influence pricing. The project will explore these relationships and apply different regression techniques to build an accurate prediction model.

PROJECT OBJECTIVES

To develop a predictive model for real estate sales prices.

To identify the key factors that significantly impact property prices.

To evaluate multiple regression models and select the most accurate one.

To provide insights and recommendations for buyers, sellers, and real estate professionals.

DATASET CHARACTERISTICS

Type: Regression

Features: Multiple attributes including property size, location, number of rooms, amenities, and market conditions.

Target Variable: Sale Price (in USD or local currency)

Source: Public real estate transaction databases or proprietary sources

This project will utilize machine learning techniques to analyze and predict real estate prices, ultimately aiding decision-making in the housing market.

#DATA CLEANING AND PREPOSSESSING

In [2]:
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
data = pd.read_csv(r"C:\Users\czone\Downloads\Real_Estate_Sales_2001-2022_GL.csv")


In [11]:
print("Shape of the dataset:", data.shape)
data.head()

Shape of the dataset: (1097629, 14)


Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
0,220008,2022,01/30/2023,Andover,618 ROUTE 6,139020.0,232000.0,0.5992,Residential,Single Family,,,,POINT (-72.343628962 41.728431984)
1,2020348,2020,09/13/2021,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463,Commercial,,,,,
2,20002,2020,10/02/2020,Ashford,390 TURNPIKE RD,253000.0,430000.0,0.5883,Residential,Single Family,,,,
3,210317,2021,07/05/2022,Avon,53 COTSWOLD WAY,329730.0,805000.0,0.4096,Residential,Single Family,,,,POINT (-72.846365959 41.781677018)
4,200212,2020,03/09/2021,Avon,5 CHESTNUT DRIVE,130400.0,179900.0,0.7248,Residential,Condo,,,,


In [13]:
print("Feature names and their datatypes:")
data.info()

Feature names and their datatypes:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1097629 entries, 0 to 1097628
Data columns (total 14 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   Serial Number     1097629 non-null  int64  
 1   List Year         1097629 non-null  int64  
 2   Date Recorded     1097627 non-null  object 
 3   Town              1097629 non-null  object 
 4   Address           1097578 non-null  object 
 5   Assessed Value    1097629 non-null  float64
 6   Sale Amount       1097629 non-null  float64
 7   Sales Ratio       1097629 non-null  float64
 8   Property Type     715183 non-null   object 
 9   Residential Type  699240 non-null   object 
 10  Non Use Code      313451 non-null   object 
 11  Assessor Remarks  171228 non-null   object 
 12  OPM remarks       13031 non-null    object 
 13  Location          298111 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 117.2+ MB


In [15]:
print("Missing values per column:")
print(data.isnull().sum())

Missing values per column:
Serial Number             0
List Year                 0
Date Recorded             2
Town                      0
Address                  51
Assessed Value            0
Sale Amount               0
Sales Ratio               0
Property Type        382446
Residential Type     398389
Non Use Code         784178
Assessor Remarks     926401
OPM remarks         1084598
Location             799518
dtype: int64


In [17]:
data_cleaned = data.dropna()

In [19]:
data_cleaned = data_cleaned.drop_duplicates()

In [27]:
print(data_cleaned.columns)


Index(['Serial Number', 'List Year', 'Date Recorded', 'Town', 'Address',
       'Assessed Value', 'Sale Amount', 'Sales Ratio', 'Property Type',
       'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks',
       'Location'],
      dtype='object')


In [29]:
data_cleaned['Date Recorded'] = pd.to_datetime(data_cleaned['Date Recorded'], errors='coerce')

In [31]:
categorical_columns = ['PropertyType', 'Town', 'ResidentialType']  # Example columns
for col in categorical_columns:
    if col in data_cleaned.columns:
        data_cleaned[col] = data_cleaned[col].astype('category')

In [33]:
print("Summary Statistics:")
print(data_cleaned.describe())

Summary Statistics:
       Serial Number    List Year                  Date Recorded  \
count   5.600000e+02   560.000000                            560   
mean    1.891504e+06  2020.185714  2021-06-18 01:09:25.714285568   
min     2.214000e+03  2016.000000            2017-01-25 00:00:00   
25%     1.804415e+05  2019.000000            2020-03-31 12:00:00   
50%     2.101240e+05  2021.000000            2021-11-20 12:00:00   
75%     2.202935e+05  2022.000000            2022-12-14 00:00:00   
max     2.020002e+08  2022.000000            2023-09-29 00:00:00   
std     1.691041e+07     1.787797                            NaN   

       Assessed Value   Sale Amount  Sales Ratio  
count    5.600000e+02  5.600000e+02   560.000000  
mean     1.701261e+05  9.091623e+05     0.838064  
min      0.000000e+00  4.500000e+03     0.000000  
25%      9.025750e+04  1.100000e+05     0.413374  
50%      1.416000e+05  2.140000e+05     0.625957  
75%      1.992075e+05  3.752500e+05     1.020047  
max      2

In [35]:
data_cleaned.to_csv("real_estate_sales_cleaned.csv", index=False)

print("Data cleaning completed. Cleaned dataset saved as real_estate_sales_cleaned.csv")

Data cleaning completed. Cleaned dataset saved as real_estate_sales_cleaned.csv


In [37]:
print("Duplicates in the dataset are:")
data.duplicated().sum()

Duplicates in the dataset are:


np.int64(0)