# King County House Sales Analysis: Decoding Real Estate Dynamics

## Project Overview

Dive into the heart of King County's real estate market, from Seattle's bustling urban centers to its serene suburban landscapes. Our mission: harness the power of regression modeling to unravel the intricate factors influencing house prices, transforming raw data into actionable insights for industry professionals and homeowners alike.

## Business Understanding

### The Challenge
In a market where factors from square footage to school districts can dramatically sway property values, understanding the drivers of home prices is paramount. We aim to develop a predictive model that combines the nuanced understanding of a seasoned realtor with the analytical precision of data science.

### Stakeholder Impact
- **Realtors**: Empower with data-driven pricing strategies
- **Homebuyers**: Guide towards informed investment decisions
- **Sellers**: Illuminate the true market value of properties
- **Developers**: Identify lucrative development opportunities
- **Urban Planners**: Inform future neighborhood development strategies

### Expected Outcomes
- Highly accurate house price predictions
- Identification of key value-influencing features
- Actionable insights for all real estate market participants

## Dataset Understanding

### Data Source
We'll be analyzing the King County House Sales dataset (kc_house_data.csv), a comprehensive repository of real estate transactions.

### Key Features
Our dataset includes 21 features for each house sale:

1. `id`: Unique identifier for each property
2. `date`: Date of the house sale
3. `price`: Sale price (our target variable)
4. `bedrooms`: Number of bedrooms
5. `bathrooms`: Number of bathrooms
6. `sqft_living`: Square footage of living space
7. `sqft_lot`: Square footage of the lot
8. `floors`: Number of floors
9. `waterfront`: Whether the property has a waterfront view
10. `view`: Quality of view from the property
11. `condition`: Overall condition of the house
12. `grade`: Overall grade given to the housing unit
13. `sqft_above`: Square footage above ground level
14. `sqft_basement`: Square footage of the basement
15. `yr_built`: Year the house was built
16. `yr_renovated`: Year of the house's last renovation
17. `zipcode`: ZIP code of the area
18. `lat`: Latitude coordinate
19. `long`: Longitude coordinate
20. `sqft_living15`: Living room area in 2015 (implies some remodeling)
21. `sqft_lot15`: Lot size area in 2015 (implies some remodeling)

### Data Considerations
- Some features (e.g., 'view', 'condition', 'grade') may require interpretation or additional research to understand their exact meanings and scales.
- We'll need to carefully consider how to handle potential outliers and missing values, especially in fields like 'yr_renovated'.
- The presence of both 'sqft_living' and 'sqft_living15' suggests potential for interesting temporal analysis.

## Our Approach: From Data Points to Market Insights

1. **Exploratory Data Analysis**: Uncover patterns and relationships within our 21 features
2. **Feature Engineering**: Craft powerful predictors, possibly combining or deriving new features from our existing set
3. **Model Development**: Build and refine our regression model, carefully selecting the most impactful features
4. **Insight Extraction**: Translate model outputs into clear, actionable insights
5. **Strategic Recommendations**: Develop targeted advice for each stakeholder group based on our findings

By decoding the intricacies of King County's housing market through these 21 key features, we're not just predicting prices – we're providing a comprehensive roadmap for navigating one of America's most dynamic real estate markets. Let's transform this rich dataset into a vivid story of market trends and opportunities!

 ## Data Mining & Preparation

In [2]:

import seaborn as sns 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objs as go
import plotly.express as px

import scipy.stats as stats
import statsmodels.api as sm
import seaborn as sns
import warnings

from sklearn.preprocessing import LabelEncoder
from scipy.stats import skew, kurtosis
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error

%matplotlib inline
plt.style.use('ggplot')

In [6]:
#Dataset overview
df = pd.read_csv("kc_house_data.csv")
df.head()


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [9]:
df.columns


Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

There are 21 columns in our dataset

In [11]:
#Check the shape of the dataset
df.shape

(21597, 21)

Our Dataset contains 21597 rows and 21 columns

In [18]:
df.value_counts()


id          date       price     bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  condition  grade      sqft_above  sqft_basement  yr_built  yr_renovated  zipcode  lat      long      sqft_living15  sqft_lot15
1000102     4/22/2015  300000.0  6         3.00       2400         9373      2.0     NO          NONE  Average    7 Average  2400        0.0            1991      0.0           98002    47.3262  -122.214  2060           7316          1
6362900080  8/14/2014  525000.0  6         3.00       2880         7560      2.0     NO          NONE  Average    7 Average  2880        0.0            1980      0.0           98144    47.5959  -122.300  1470           1815          1
6362900172  9/23/2014  499950.0  3         3.50       1820         1991      2.0     NO          NONE  Average    8 Good     1430        390.0          2014      0.0           98144    47.5960  -122.298  1550           1460          1
6365900065  7/18/2014  334850.0  2         1.00       870        

In [19]:
df.isna().sum()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

Waterfront column has 2376 null values while yr_renovated hase 3842 null values