<a href="https://colab.research.google.com/github/Piripack/House-price-prediction/blob/main/Untitled22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Property Price Prediction Project

This project aims to predict the average property prices for different types of properties (Detached, Semi-Detached, Terraced, and Flat) in the UK using historical data. The dataset covers property prices from various regions of the UK over the past two decades. The goal of this project is to showcase my skills in data preprocessing, feature engineering, machine learning model development, and evaluation.

## Table of Contents

- [Project Overview](#project-overview)
- [Data Cleaning & Preprocessing](#data-cleaning--preprocessing)
- [Feature Engineering](#feature-engineering)
- [Machine Learning Model](#machine-learning-model)
- [Model Evaluation](#model-evaluation)
- [Visualizations](#visualizations)
- [Conclusion](#conclusion)
- [Future Work](#future-work)

## Data Cleaning & Preprocessing

The dataset is loaded, cleaned, and preprocessed to remove missing values and irrelevant rows. The following steps are carried out:

1. **Convert Date to datetime format**: The 'Date' column is converted to the proper datetime format for easier manipulation.
2. **Remove Data Before 2005**: Data from years prior to 2005 is removed to focus on more recent trends.
3. **Rolling Averages**: A 12-month rolling average is calculated for each property type to smooth out fluctuations and identify long-term trends.

## Feature Engineering

The dataset is enhanced by creating new features such as:
- **Rolling averages** for each property type (Detached, Semi-Detached, Terraced, Flat) to highlight long-term trends.
- **Regional Aggregation**: Regional average prices for property types are computed to examine the regional variations in property prices.

## Machine Learning Model

A Random Forest Regressor model is trained using historical data from 2005-2007 and tested on 2008 data. The model predicts property prices for detached houses, but the same methodology can be extended to other property types. The performance is evaluated using common regression metrics such as RMSE, MAE, and R².

## Model Evaluation

The model is evaluated using the test set (2008 data) and also through performance metrics including:
- **RMSE (Root Mean Squared Error)**
- **MAE (Mean Absolute Error)**
- **R² Score**: To evaluate how well the model fits the data.

## Visualizations

Several visualizations are included to showcase the trends and evaluation results, including:
- **Price Trend Over Time**: Showing the property price trends over time.
- **Feature Importance**: Visualizing the most important features used by the Random Forest model.
- **Actual vs Predicted Prices**: Scatter plot comparing actual vs predicted property prices.
- **Model Evaluation Metrics**: Bar chart comparing RMSE, MAE, and R² scores for the training and testing sets.
- **Residual Plot**: Visualizing the residuals (errors) of the model to check for any patterns.

## Conclusion

This project demonstrates the application of data preprocessing, feature engineering, and machine learning techniques to predict property prices. The model performed well with good evaluation scores and provides a solid foundation for further exploration and improvements.

## Future Work

Future work includes:
- **Hyperparameter tuning** to optimize the Random Forest model.
- **Expanding the model** to predict prices for other property types.
- **Time series forecasting** methods to predict future property prices.


# Project Code:


In [1]:
import pandas as pd

# Load the dataset
file_path = 'Average-prices-Property-Type-2023-12.csv'
df = pd.read_csv(file_path)

# Show the first few rows of the dataset to understand its structure
df.head()


Unnamed: 0,Date,Region_Name,Area_Code,Detached_Average_Price,Detached_Index,Detached_Monthly_Change,Detached_Annual_Change,Semi_Detached_Average_Price,Semi_Detached_Index,Semi_Detached_Monthly_Change,Semi_Detached_Annual_Change,Terraced_Average_Price,Terraced_Index,Terraced_Monthly_Change,Terraced_Annual_Change,Flat_Average_Price,Flat_Index,Flat_Monthly_Change,Flat_Annual_Change
0,1995-01-01,England,E92000001,86314.15895,28.257874,,,51533.22543,27.436474,,,41489.82431,25.279664,,,45218.54082,23.762969,,
1,1995-01-01,Wales,W92000004,66539.58684,32.491063,,,41043.45436,31.399881,,,32506.88477,30.777231,,,34061.27288,34.448112,,
2,1995-01-01,Inner London,E13000001,194483.5365,16.399257,,,121073.17,15.327414,,,87553.48096,14.627111,,,73707.69351,15.492239,,
3,1995-01-01,Outer London,E13000002,160329.9602,22.303302,,,94802.27143,21.065017,,,70087.65516,20.040752,,,58266.86811,21.764751,,
4,1995-01-01,London,E12000007,161449.3055,21.715622,,,95897.5293,20.321394,,,73705.96582,18.023197,,,64618.57236,17.858341,,


### Loading the Dataset

The dataset is loaded into a Pandas DataFrame. The dataset contains columns like 'Date', 'Detached_Average_Price', 'Semi_Detached_Average_Price', 'Terraced_Average_Price', and 'Flat_Average_Price'. The first few rows of the dataset are displayed to understand its structure.


In [3]:
# Convert the 'Date' column to datetime format to make filtering easier
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Remove data before 2005
df_cleaned = df[df['Date'].dt.year >= 2005]

# Checking if the cleaning was successful
df_cleaned.head(10)


Unnamed: 0,Date,Region_Name,Area_Code,Detached_Average_Price,Detached_Index,Detached_Monthly_Change,Detached_Annual_Change,Semi_Detached_Average_Price,Semi_Detached_Index,Semi_Detached_Monthly_Change,Semi_Detached_Annual_Change,Terraced_Average_Price,Terraced_Index,Terraced_Monthly_Change,Terraced_Annual_Change,Flat_Average_Price,Flat_Index,Flat_Monthly_Change,Flat_Annual_Change
43368,2005-01-01,Wales,W92000004,186546.1861,91.089893,1.923221,22.400125,117433.2225,89.841103,1.214256,23.849785,94407.93991,89.384601,1.369423,28.517332,101512.1444,102.665033,1.510667,23.687107
43369,2005-01-01,England,E92000001,246075.8815,80.561305,0.130084,9.956122,151184.3086,80.491068,-0.937909,13.468535,125600.9477,76.528396,-1.055054,17.310283,143394.5548,75.355825,-1.269064,10.862571
43370,2005-01-01,Northern Ireland,N92000001,160428.8327,95.46556,,,104030.5874,95.251126,,,82022.27076,108.704488,,,100737.8658,115.277614,,
43371,2005-01-01,Scotland,S92000003,164859.6848,70.937575,0.393706,14.332761,96584.09388,68.797658,-2.817655,17.327684,74711.90178,66.311541,-4.294214,20.105968,72276.21552,74.400535,-4.34711,19.059221
43372,2005-01-01,Inner London,E13000001,548844.7127,46.279729,-0.051831,3.702792,375534.907,47.541327,0.003076,4.885212,284790.261,47.578449,0.326981,7.553006,239983.3959,50.440869,0.264361,6.066293
43373,2005-01-01,Outer London,E13000002,441604.4202,61.431044,-0.021682,4.600593,279666.3044,62.141713,-0.27308,6.103891,213326.3591,60.998197,-0.043285,8.772424,183495.978,68.542285,-0.142638,7.682202
43374,2005-01-01,Falkirk,S12000014,135747.1509,71.856164,-5.050601,3.010213,76611.63183,68.692988,-5.654207,8.042342,55885.30979,66.653298,-5.552982,11.535647,46862.0455,74.090638,-5.630112,11.670259
43375,2005-01-01,Clackmannanshire,S12000005,138689.0639,73.84548,-1.777064,3.440039,77433.98143,72.219463,-2.696416,8.602571,56314.97226,69.092739,-2.591708,12.249645,46981.05262,76.724311,-2.626752,11.910329
43376,2005-01-01,Stirling,S12000030,222580.6187,82.108333,6.327137,2.193341,117009.2043,80.454298,5.921551,7.310029,90267.00444,76.769001,6.247826,10.678267,91755.968,85.740782,5.46195,9.990052
43377,2005-01-01,Argyll and Bute,S12000035,163388.0024,79.162795,5.344505,26.899792,101445.8192,76.719753,5.36782,33.169485,78398.5382,75.644696,5.91601,36.818946,74255.83085,83.599514,5.558676,37.06628


### Data Cleaning

The 'Date' column is converted to the proper datetime format for easier filtering. Rows with dates before 2005 are removed to focus on more recent property price trends. The cleaned dataset is displayed to confirm that the preprocessing steps were successful.


In [8]:
# Calculate rolling averages (12-month rolling average as an example)
df_cleaned.loc[:, 'Detached_Rolling_Avg'] = df_cleaned['Detached_Average_Price'].rolling(window=12).mean()
df_cleaned.loc[:, 'Semi_Detached_Rolling_Avg'] = df_cleaned['Semi_Detached_Average_Price'].rolling(window=12).mean()
df_cleaned.loc[:, 'Terraced_Rolling_Avg'] = df_cleaned['Terraced_Average_Price'].rolling(window=12).mean()
df_cleaned.loc[:, 'Flat_Rolling_Avg'] = df_cleaned['Flat_Average_Price'].rolling(window=12).mean()


### Rolling Averages

12-month rolling averages are calculated for each property type (Detached, Semi-Detached, Terraced, and Flat) to smooth out fluctuations in the data and highlight long-term trends. These new features will be used for further analysis and model training.


In [10]:
# Create region-wise features to track changes in specific regions (e.g., England vs Scotland)
region_avg_prices = df_cleaned.groupby(['Region_Name', 'Year'])[['Detached_Average_Price',
                                                             'Semi_Detached_Average_Price',
                                                             'Terraced_Average_Price',
                                                             'Flat_Average_Price']].mean().reset_index()


KeyError: 'Year'