# Housing Price Prediction
This [dataset](https://www.kaggle.com/datasets/muhammadbinimran/housing-price-prediction-data/data) contains 50,000 rows and 6 columns this will be expanded in the future for the model.

Some limitations of the dataset are that it is a synthetic dataset and does not contain any real-world information.
 
These are some things that would help to improve the model for the real world.

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [21]:
df = pd.read_csv('../Data/housing_price.csv')

In [22]:
#checking first 5 rows
df.head()

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
0,2126,4,1,Rural,1969,215355.283618
1,2459,3,2,Rural,1980,195014.221626
2,1860,2,1,Suburb,1970,306891.012076
3,2294,2,1,Urban,1996,206786.787153
4,2130,5,2,Suburb,2001,272436.239065


- SquareFeet: The size of the house in square feet.
- Bedrooms: How many bedrooms the house has.
- Bathrooms: Number of bathrooms.
- Neighborhood: Whether the house is in a rural, suburban, or urban area.
- YearBuilt: The year the house was built.
- Price: The price of the house.

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SquareFeet    50000 non-null  int64  
 1   Bedrooms      50000 non-null  int64  
 2   Bathrooms     50000 non-null  int64  
 3   Neighborhood  50000 non-null  object 
 4   YearBuilt     50000 non-null  int64  
 5   Price         50000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 2.3+ MB


In [24]:
# checking for null values
df.isnull().sum()

SquareFeet      0
Bedrooms        0
Bathrooms       0
Neighborhood    0
YearBuilt       0
Price           0
dtype: int64

looking over this data we can see that there are no null values. we should also check for outliers.

In [25]:
df['Price'].describe()

count     50000.000000
mean     224827.325151
std       76141.842966
min      -36588.165397
25%      169955.860225
50%      225052.141166
75%      279373.630052
max      492195.259972
Name: Price, dtype: float64

Descriptive Statistics:

- The average (mean) price of a house is about 224,827.
- The standard deviation is around 76,142. This tells us how spread out the prices are.
- The minimum price is strangely -36,588. That's weird, right? It's like saying a house costs less than zero!
- The maximum price is about 492,195.

In [26]:
#lets check the lowest prices and see if we can make sence of it 
df.sort_values(by='Price').head(22)

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
33666,1013,5,2,Urban,1960,-36588.165397
17706,1080,5,1,Rural,1955,-28774.998022
1266,1024,2,2,Urban,2006,-24715.242482
8720,1235,3,1,Urban,1952,-24183.000515
5118,1140,4,1,Urban,2020,-23911.003119
3630,1235,3,2,Rural,2012,-19871.251146
20211,1049,3,1,Rural,2005,-18159.685676
6355,1016,5,2,Rural,1997,-13803.684059
9611,1131,3,3,Urban,1959,-13692.026068
4162,1352,5,2,Suburb,1977,-10608.359522


In [29]:
negative_price = (df['Price'] < 0).sum()
percentage_negative_price = (negative_price / len(df)) * 100
percentage_negative_price

0.044000000000000004

So I think its worth to remove these negative prices as they are only about 0.044% of the dataset.

In [30]:
df = df[df['Price'] >= 0]

In [33]:
df.sort_values(by='Price').head()

Unnamed: 0,SquareFeet,Bedrooms,Bathrooms,Neighborhood,YearBuilt,Price
40144,1006,2,1,Suburb,1973,154.77912
17216,1013,2,1,Suburb,2018,276.063516
36235,1112,3,1,Suburb,1978,2360.27445
29980,1005,3,3,Urban,1978,2697.849758
23662,1256,3,1,Rural,1978,3000.859614


so we have removed all the negative prices, we are going to keep some of these lower prices as they might be familys that sold them to other family members or some other reasons.