# The Impact of Location and Property Characteristics on House Sale Prices : An Inferential Analysis

## 1. Business Understanding

### (a) Introduction

Real Estate is property consisting of land and the buildings on it, along with its natural resourses.The history of real estate can be tracked back to ancient times, when land was acquired by conquest, purchase or inheritance. In the United States, real estate brokers began presenting houses for sale around 1900. By 1908, the National Association of Real Estate Exchanges was founded to bring brokers and agents together to facilitate selling of homes.

Real Estate agencies are business organizations that generally represent either the buyer or the seller in terms of home transactions, and work as a collective group of lincensed agents and/or brokers who operate a given geographical area. Real Estate agents are hired to market and sell properties on behalf of home sellers. They vet potential buyers, lead viewings, and help negotiate final selling price. They usually work to a base annual salary and may earn commision for house sales.

Agents who work for the seller, also known as listing agents, advise clients on how to price the property and prepare for a sale, including providing tips on last-minute improvements that can help boost the price or encourage speedy offers. Seller agents market the property through listing services, networking and advertisements. On the other hand, agents who work for the buyers search for available properties that match the buyer's price range and wish list.These agents often look at past sales data on comparable properties to help prospective buyers come up with a fair bid. 

Generally , real estate agencies act as intermidiaries between property buyers, sellers, landlords and tenants. They represent their clients' interests and work to achieve their goals in real estate transactions. This representation may involve marketing properties, identifying potential buyers or tenants, negotiating deals and handling paperwork.

This project aims to use linear regression to analyze the relationship between the location, the house characteristics and its sale price by developing a model that takes into account intrinsic characteristics of a property such as number of bedrooms , number of bathrooms , square footage of living space , level of craftmanship used to build the house, square footage of the lot and extrinsic factors such as the location of the property. By analyzing this factors, the model will be able to provide guidance to Azizi Realtors real estate agency when it comes to advising their clientel on property valuations. This approach offers a more scientific approach to real estate valuations compared to the traditional approaches that can lean towards the qualitative side. This analysis will provide valuable insights to the clientel of Azizi Realtors real estate agency helping them make informed decisions which in turn will benefit the agency by providing valuable service to their clients.




### (b) Problem Statement

Azizi Realtors want to provide effective advice to their clientel on how the location and house characteristics may increase the estimated value of a house. For the agency to do this effectively, they need a deep understanding on the factors that influence property values. Our goal is to develop a linear regression model that uses data on past properties to accurately capture the relationship between a house's location, characteristics and sale price. This model can provide valuable insights for the analysis, allowing us to estimate how sale price changes as the independent variables change. By providing Azizi Realtors with this information, they can effectively advice their clients when it comes to buying, selling and investing in properties. By doing so we aim to increase the business value of the agency enabling them provide accurate and informed advice to their clientel , leading to increased customer flow, satisfaction and loyalty.

### (c) Defining a metric for success

### (d) Main Objective

To develop a multiple linear regression model that can establish a relationship between a house's location and characteristics and their impact on the house prices

### (e) Specific Objectives

- Testing the assumptions of multiple linear regression
- Analyzing data to identify the most important factors that affect house prices
- Building a linear regression model that evaluates a house's price and how it is impacted by location and house characteristics.
- Evaluating the models statistical significance , MSE and coefficients to come up with interpretable results

### (f) Recording the Experimental Design

- **Reading and checking data** 
This stage involves examining the data and making sense of the column names and their various meaning.

- **Data Wrangling** 
This stage involves handling missing and place holder values, removing outliers and handling categorical data to be able to use it in the model.

- **Modelling**
This stage involves fitting a linear regression model with the sale price as a dependent variable and examining how it changes as the independent variables change.

- **Regresssion Results**
This stage involves interpretating the model's coefficients, the R-squared and the MSE to come up with meaningful insights

- **Conclusions and Recommendations**
This stage involves using the results of the analysis to come up with insights and recommendations for Azizi Realtors inorder for them to provide effective advice to their clientel.

### (g) Data Relevance

This analysis will use data from King County housing dataset. The dataset has 21598 rows and 21 columns.This dataset includes information such as the number of bedrooms and bathrooms, the square footage of the house and the its location.By using this data I can develop a multiple linear regression model that can establish a relationship between a house's location and characteristics and their impact on the house prices. By using this data, we can gain insights on how these factors affect the house prices in King County, Washington. This can help the Azizi Realtors make future predictions on house prices and therefore effectively advice their clients.

## 2. Reading and Checking Data

In [2]:
# importing the necessary libraries

import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline





In [3]:
# reading into the data

housing_df = pd.read_csv("data/kc_house_data.csv")
housing_df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [4]:
# checking the last rows of the dataset

housing_df.tail()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21592,263000018,5/21/2014,360000.0,3,2.5,1530,1131,3.0,NO,NONE,...,8 Good,1530,0.0,2009,0.0,98103,47.6993,-122.346,1530,1509
21593,6600060120,2/23/2015,400000.0,4,2.5,2310,5813,2.0,NO,NONE,...,8 Good,2310,0.0,2014,0.0,98146,47.5107,-122.362,1830,7200
21594,1523300141,6/23/2014,402101.0,2,0.75,1020,1350,2.0,NO,NONE,...,7 Average,1020,0.0,2009,0.0,98144,47.5944,-122.299,1020,2007
21595,291310100,1/16/2015,400000.0,3,2.5,1600,2388,2.0,,NONE,...,8 Good,1600,0.0,2004,0.0,98027,47.5345,-122.069,1410,1287
21596,1523300157,10/15/2014,325000.0,2,0.75,1020,1076,2.0,NO,NONE,...,7 Average,1020,0.0,2008,0.0,98144,47.5941,-122.299,1020,1357


In [5]:
#checking the info of the data

housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

## 3. Data Wrangling

In [6]:
housing_df = housing_df[["price","bedrooms","bathrooms","sqft_living","sqft_lot","grade","waterfront","view","floors"]]
housing_df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,grade,waterfront,view,floors
0,221900.0,3,1.0,1180,5650,7 Average,,NONE,1.0
1,538000.0,3,2.25,2570,7242,7 Average,NO,NONE,2.0
2,180000.0,2,1.0,770,10000,6 Low Average,NO,NONE,1.0
3,604000.0,4,3.0,1960,5000,7 Average,NO,NONE,1.0
4,510000.0,3,2.0,1680,8080,8 Good,NO,NONE,1.0


The dataset has been constricted to only the columns that are going to be used in a model to make it easier to focus on these columns which may improve the quality of the analysis.

#### (i) Identifying and handling missing data

In [7]:
# checking for missing values in the dataset

housing_df.isna().sum()

price             0
bedrooms          0
bathrooms         0
sqft_living       0
sqft_lot          0
grade             0
waterfront     2376
view             63
floors            0
dtype: int64

In [8]:
# checking the unique values in waterfront column

housing_df["waterfront"].unique()

array([nan, 'NO', 'YES'], dtype=object)

In [9]:
# filling the missing values in the waterfront column

housing_df["waterfront"] = housing_df["waterfront"].fillna("NO")

All the missing values in the waterfront column have been filled with "NO" under the assumption that all the house that had a null value in the waterfront column were not on a waterfront. We do this to preserve the information contained in the remaining data and avoid losing valuable information by dropping the rows with missing values.This helps reduce bias in the model hence a more accurate and robust model.

In [10]:
# checking for unique values in the view column

housing_df["view"].unique()

array(['NONE', nan, 'GOOD', 'EXCELLENT', 'AVERAGE', 'FAIR'], dtype=object)

In [11]:
# filling the missing values in the view column with NONE

housing_df["view"] = housing_df["view"].fillna("NONE")

All the missing values in the view column have been filled with NONE. This is to avoid losing valuable information by dropping the rows where the value in the view column is null.By doing this we preserve all the information in our data for a more accurate model.

In [12]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        21597 non-null  float64
 1   bedrooms     21597 non-null  int64  
 2   bathrooms    21597 non-null  float64
 3   sqft_living  21597 non-null  int64  
 4   sqft_lot     21597 non-null  int64  
 5   grade        21597 non-null  object 
 6   waterfront   21597 non-null  object 
 7   view         21597 non-null  object 
 8   floors       21597 non-null  float64
dtypes: float64(3), int64(3), object(3)
memory usage: 1.5+ MB


#### (ii) Checking and removing outliers

In [13]:
# checking for skewness in the price column

print(housing_df['price'].skew())

4.023364652271239


In [14]:
# defining the mean and standard deviation

price_std = housing_df.price.std()
price_mean = housing_df.price.mean()
housing_df['price'].describe()

count    2.159700e+04
mean     5.402966e+05
std      3.673681e+05
min      7.800000e+04
25%      3.220000e+05
50%      4.500000e+05
75%      6.450000e+05
max      7.700000e+06
Name: price, dtype: float64

In [15]:
# Removing outliers outside of 3 STDs of Price
new_housing_df = housing_df.copy() 
index = new_housing_df[(new_housing_df['price'] >= 
                            price_mean+3*price_std)].index
new_housing_df.drop(index, inplace = True)
new_housing_df['price'].describe()

count    2.119100e+04
mean     5.070103e+05
std      2.594622e+05
min      7.800000e+04
25%      3.200000e+05
50%      4.470000e+05
75%      6.276500e+05
max      1.640000e+06
Name: price, dtype: float64

In [16]:
# rechecking skewness

print(new_housing_df['price'].skew())

1.3892777400843523


After removing the outliers in the price column, the data is less skewed to the right.

In [23]:
# checking for skewness in the sqft lot column

print(new_housing_df["sqft_lot"].skew())

13.201280275411648


In [24]:
#Removing outliers outside of 3 STDs of sqft_lot
new_housing_df = new_housing_df.copy() 
index = new_housing_df[(new_housing_df['sqft_lot'] >= 
                            price_mean+3*price_std)].index
new_housing_df.drop(index, inplace = True)
new_housing_df['sqft_lot'].describe()

count    2.119000e+04
mean     1.474950e+04
std      3.880606e+04
min      5.200000e+02
25%      5.005250e+03
50%      7.560000e+03
75%      1.049000e+04
max      1.164794e+06
Name: sqft_lot, dtype: float64

In [26]:
# rechecking for skewness

print(new_housing_df["sqft_lot"].skew())

11.363425498371784


After removing the outliers in the sqft_lot column the data seems to be less skewed to the right although not very significantly.

In [28]:
print(new_housing_df["floors"].skew())
print(new_housing_df["sqft_living"].skew())
print(new_housing_df["bathrooms"].skew())
print(new_housing_df["bedrooms"].skew())




0.6385327478060837
1.0011866219060426
0.3135782409537157
2.083563232445137


The outliers from the two most skewed columns have been removed since outliers can cause various effects on statistical models, such as reducing the power and validity of statistical tests, causing bias and influence in estimates and regression models, and providing insights and revealing errors in data analysis