# Exploring the Dynamics of Real Estate Market in King County: A Data Science Perspective

Business Overview

The U.S. real estate market changes over time, generally going up but with some drops. Several key factors influence these changes:

Key Factors Affecting the Real Estate Market:

Supply and Demand:

Supply: Limited housing supply compared to demand drives prices up. Too many houses available can lower prices.

Demand: High demand, driven by more people looking for homes, pushes prices higher.

Economic Factors:

Employment Levels: More jobs mean more people can afford homes, increasing demand and prices.

GDP Growth: A strong economy boosts confidence and demand for homes.

Inflation: Higher costs of goods can affect the affordability of homes.

Interest Rates:

Low Interest Rates: Cheaper borrowing costs make buying homes more affordable, increasing demand and prices.

High Interest Rates: Expensive borrowing can reduce demand and stabilize or lower prices.

Demographic Shifts:

Household Formation: More new households mean more people looking for homes, increasing demand.
Migration Patterns: People moving in or out of an area can impact the local housing market.

Location-Specific Factors:

Proximity to Employment: Areas close to job centers usually have higher demand and prices.
Quality of Schools: Good schools attract families, raising demand and prices.
Neighborhood Amenities: Access to parks, shopping, and entertainment can make certain areas more desirable.
Understanding these factors helps people make better decisions when buying, selling, or investing in real estate.

## 1 PROJECT ALIGNMENT

### 1.1. Project Scope

Our project aims to equip Nara Real Estate(stakeholder) with the necessary insights and strategies to facilitate a successful entry into the King County real estate market. By leveraging data-driven analysis and market intelligence, we will provide actionable recommendations to navigate the complexities of the local market landscape effectively.

### 1.2. Problem Statement:

Despite its potential for growth and profitability, entering the King County real estate market presents Nara Real Estate with significant challenges stemming from the market's dynamic nature and diverse factors influencing supply, demand, and pricing. To ensure a successful market penetration strategy, Nara Real Estate requires a comprehensive understanding of local market dynamics, including the impact of economic conditions, demographic shifts, and location-specific elements on housing preferences and demand. Additionally, the company needs actionable insights and strategies derived from data-driven analysis to effectively identify lucrative market segments, optimize pricing strategies, and enhance client acquisition and retention efforts. Therefore, the overarching problem statement is to equip Nara Real Estate with the necessary tools, insights, and strategies to navigate the complexities of the King County real estate market and establish a strong presence while capturing market share effectively.

### 1.3. Objectives

Through our data analytics and market insights, we offer Nara Real Estate a strategic advantage by answering the following questions:

**1. House features affecting the prices of houses in King County**

Understanding home buyers' preferences can focus our campaign and help us guide clients in purchase of their new homes.

**2. Seasonal impact on house sale prices**

Understanding seasonal trends will influence when the campaign should be launched.

**3. Predicting Market trends and property value**

Using the dataset provided to create a model that predicts the market trend of the area and the property values.

**4. Locations which have the highest average house prices**

Understanding what locations to focus the advertising campaign on is key for our stakeholders.

We have been provided with a dataset with house sale prices in King County, Washington State, USA from May 2014 to May 2015 to use for this project.

### 1.4. Brief Conclusion

Through our comprehensive analysis and strategic recommendations, we aim to empower Nara Real Estate to make informed decisions and successfully enter the King County real estate market. Our data-driven approach will help them achieve sustainable growth and enance their penetration of King county real estate market..

## 2 DATA UNDERSTANDING
### Dataset Description

The data utilized for this project consists the following dataset:

`data/kc_house_data.csv`: This dataset contains detailed information about individual properties in King County, including attributes such as square footage, number of bedrooms and bathrooms, location, and sale price.

Here are the key columns in the datasets:

`id`: Unique identifier for each house sale.

`date`: Date of the house sale.

`price`: Sale price of the house.

`bedrooms`: Number of bedrooms in the house.

`bathrooms`: Number of bathrooms in the house.

`sqft_living`: Square footage of the living area.

`sqft_lot`: Square footage of the lot.

`floors`: Number of floors in the house.

`waterfront`: Whether the house has a waterfront view (0 for no, 1 for yes).

`condition`: Overall condition of the house.

`grade`: Overall grade given to the housing unit, based on King County grading system.

`sqft_above`: Square footage of the house above ground level.

`sqft_basement`: Square footage of the basement.

`yr_built`: Year the house was built.

`yr_renovated`: Year the house was renovated.

`zipcode`: Zip code of the house location.

`lat`: Latitude coordinate of the house.

`long`: Longitude coordinate of the house.

`sqft_living15`: Average square footage of interior housing living space for the nearest 15 neighbors.

`sqft_lot15`: Average square footage of the land lots of the nearest 15 neighbors.

### Relevance of king County dataset from stakeholder

The columns in the dataset provide crucial information about various aspects of the houses that could potentially influence their sale prices. Features such as number of bedrooms, bathrooms, square footage, condition, and grade are likely to have a significant impact on home values. We'll use these features to build regression models and identify which characteristics contribute most to home prices.

## 3 DATA PREPARATION

The approach taken shall involve the following steps:

1. Data Mining
2. Data Cleaning

### 3.1 DATA MINING
We shall import the necessary libraries for the whole data analysis approach we shall be taking as well as reading into the various documents that we shall be using. We shall display the first 5 results of each to get a better understanding of what is in each documents and give a summary of what we are observing

In [13]:
# libraries required for the data anaylsis

import folium
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objs as go
import plotly.express as px
import reverse_geocoder as rg
import scipy.stats as stats
import statsmodels.api as sm
import seaborn as sns
import warnings

from sklearn.preprocessing import LabelEncoder
from scipy.stats import skew, kurtosis
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

%matplotlib inline
plt.style.use('ggplot')
warnings.filterwarnings('ignore')
pio.renderers.default = 'notebook_connected'

ModuleNotFoundError: No module named 'folium'

**Relevance of king County dataset from stakeholder**

The columns in the dataset provide crucial information about various aspects of the houses that could potentially influence their sale prices. Features such as number of bedrooms, bathrooms, square footage, condition, and grade are likely to have a significant impact on home values. We'll use these features to build regression models and identify which renovations or characteristics contribute most to home prices.

In [5]:
pwd


'c:\\Users\\Augustine Wanyonyi\\Desktop\\Phase_2_Project_2024\\Phase-2-project\\Work'

**Overview of kc_house_data dataset**

In [6]:
import pandas as pd

# reading the data into the king_county_df
king_county_df=pd.read_csv("kc_house_data.csv")
king_county_df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [7]:
king_county_df["grade"].value_counts()

grade
7 Average        8974
8 Good           6065
9 Better         2615
6 Low Average    2038
10 Very Good     1134
11 Excellent      399
5 Fair            242
12 Luxury          89
4 Low              27
13 Mansion         13
3 Poor              1
Name: count, dtype: int64

In [8]:
#Looking at the info printout
king_county_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

In [9]:
#conversion of date to dtype datetime to confirm timeframe of dataset
date_df=pd.to_datetime(king_county_df["date"])
date_df

0       2014-10-13
1       2014-12-09
2       2015-02-25
3       2014-12-09
4       2015-02-18
           ...    
21592   2014-05-21
21593   2015-02-23
21594   2014-06-23
21595   2015-01-16
21596   2014-10-15
Name: date, Length: 21597, dtype: datetime64[ns]

In [10]:
# Timestamp of the dataframe
date_df.min(),date_df.max()

(Timestamp('2014-05-02 00:00:00'), Timestamp('2015-05-27 00:00:00'))

Cursory Observation:
1. The dataframe has 21,597 entries with waterfront, view and yr_renovated having null entries
2. Datatypes range from int64, float64 and objects and will require further analysis
3. 21 columns in the dataset, further analysis to determine if all shall be used 

### 3.2. DATA CLEANING
Data cleaning shall involve the following steps:
1. Check and resolve for duplicate values
2. Check and resolve for null values
3. Check and resolve for extraneous values
4. Perform further cleaning as needed

In [11]:
#FUNCTIONS TO BE USED DURING DATA CLEANING

#Function to get the number of duplictes
def get_duplicates(df):
    df=df[df.duplicated(keep=False)]
    return df

# Function to get extraneous values i.e. values that look like placeholders or are exaggerated values
def extraneous_values(df):
    for col in df.columns:
        print(col, '\n', df[col].value_counts(normalize=True), '\n')

# Function to calculate percentage of missing data in a column
def missing_data(df, column):
    length_of_df=len(df)                                                    #getting the length of the dataframe
    missing_data= column.isna().sum()                                       #total number of missing data in column foreign_gross
    percentage_of_missing_data = round((missing_data/length_of_df*100),2)   #percentage of missing data in the foreign_gross column
    return print(f"Percentage of Missing Data: {percentage_of_missing_data}""%")


**Check and resolve for duplicate values in King_county_df**

In [14]:
#checking for duplicates in king_county_df
get_duplicates(king_county_df)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


In [15]:
#checking unique identifiers for houses
king_county_df["id"].value_counts()

id
795000620     3
2231500030    2
2767602141    2
6632900574    2
7853420110    2
             ..
8091400200    1
3814700200    1
1202000200    1
1794500383    1
2008000270    1
Name: count, Length: 21420, dtype: int64

In [16]:
# exploration of ID unique identifier 795000620
king_county_df[king_county_df["id"]==795000620]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
17588,795000620,9/24/2014,115000.0,3,1.0,1080,6250,1.0,NO,NONE,...,5 Fair,1080,0.0,1950,0.0,98168,47.5045,-122.33,1070,6250
17589,795000620,12/15/2014,124000.0,3,1.0,1080,6250,1.0,NO,NONE,...,5 Fair,1080,0.0,1950,0.0,98168,47.5045,-122.33,1070,6250
17590,795000620,3/11/2015,157000.0,3,1.0,1080,6250,1.0,,NONE,...,5 Fair,1080,0.0,1950,,98168,47.5045,-122.33,1070,6250


In [17]:
# difference in unique house identifiers and total entries
multiple_times_sold=len(king_county_df["id"])-len(king_county_df["id"].value_counts())
print(f"No of houses sold more than once in a year: ", multiple_times_sold )

No of houses sold more than once in a year:  177


Cursory Observations

The length of the unique identifier ID is 21,420 which is less than the 21,597 entries seen from the info printout. This would indicate duplicates of the unique identifier but we can see that each entry is unique to the database.
We can conclude that there are 177 houses that have been sold more than once between May 2014 and May 2015. We can hence conclude that there are no duplicates and that there is flipping of houses. This can be confirmed by a report by ATTOM team (reference link in appendix) showing steady growth in house flipping in the US.