# House sales prices in King County

A project on exploratory data analysis.

Sebastian Thomas @ neue fische Bootcamp Data Science<br />
(datascience at sebastianthomas dot de)

![house][photo]

[photo]: house.jpg "House"
<div style="text-align: right">(Source: Image of <a href="https://pixabay.com/de/users/Pexels-2286921/">Pexels</a> at <a href="https://pixabay.com/">Pixabay</a>)</div>

# Part 0: Business understanding

## Data origin
The data was given to me as a project on exploratory data analysis at the neue fische Bootcamp Data Science. It is a data set that is often used in teaching and can also be found on [kaggle](www.kaggle.com). The origin of the data set is probably https://blue.kingcounty.com/Assessor/eRealProperty/default.aspx, where additional data may be found.

## Original features

The instances represent houses in King County in the State of Washington, in the Pacific Northwest region of the United States of America. The given features are as follows:

 feature           | description                                                   | type
:------------------|:--------------------------------------------------------------|:----------------------
 `'id'`            | unique identifier, equals to parcel number, **used as index** | discrete (numeric)
 `'date'`          | date when object was sold                                     | date time
 `'price'`         | sale price, **used as prediction target**                     | continuous (numeric)
 `'bedrooms'`      | number of bedrooms in house                                   | discrete (numeric)
 `'bathrooms'`     | number of bathrooms in house                                  | discrete (numeric)
 `'sqft_living'`   | square footage of interior housing living space               | continuous (numeric)
 `'sqft_lot'`      | square footage of land lot                                    | continuous (numeric)
 `'floors'`        | total floors (levels) in house                                | discrete (numeric)
 `'waterfront'`    | indicator whether house lies at a waterfront                  | boolean
 `'view'`          | indicator for views of object                                 | ordered (categorical)
 `'condition'`     | overall condition of object                                   | ordered (categorical)
 `'grade'` | overall grade given to the housing unit, based on King County grading system | ordered (categorical)
 `'sqft_above'`    | square footage of house apart from basement                   | continuous (numeric)
 `'sqft_basement'` | square footage of basement                                    | continuous (numeric)
 `'yr_built'`      | year when house was built                                     | discrete (numeric)
 `'yr_renovated'`  | year when house was renovated                                 | discrete (numeric)
 `'zipcode'`       | zipcode of area where house is positioned                     | nominal (categorical)
 `'lat'`           | latitude coordinate of object                                 | continuous (numeric)
 `'long'`          | longitude coordinate of object                                | continuous (numeric)
 `'sqft_living15'` | square footage of interior housing living space for the nearest 15 neighbors | continuous (numeric)
 `'sqft_lot15'`    | square footage of land lots of the nearest 15 neighbors       | continuous (numeric)

### Feature `'bathrooms'`

The feature `'bathrooms'` has float values. The meaning is as follows:

 value  |    description     | meaning
:------:|:------------------:|:----------------------------------------------------
 `1.00` |     full bath      | having a toilet, a sink, a shower and a tub
 `0.75` | three-quarter-bath | having only a toilet, a sink and (usually) a shower
 `0.50` |     half-bath      | having only a toilet and a sink
 `0.25` |    powder room     | having only a sink (very rare)

<div style="text-align: right">(Source: e.g. <a href="https://en.wikipedia.org/wiki/Bathroom#Variations_and_terminology">Wikipedia</a>)</div>

### Feature `'condition'`

The meaning of the feature `'condition'` is as follows:

 value | description | meaning
:-----:|:-----------:|:--------
  `1`  |    Poor     | Worn out. Repair and overhaul needed on painted surfaces, roofing, plumbing, heating and numerous functional inadequacies. Excessive deferred maintenance and abuse, limited value-in-use, approaching abandonment or major reconstruction; reuse or change in occupancy is imminent. Effective age is near the end of the scale regardless of the actual chronological age.
  `2`  |    Fair     | Badly worn. Much repair needed. Many items need refinishing or overhauling, deferred maintenance obvious, inadequate building utility and systems all shortening the life expectancy and increasing the effective age.
  `3`  |   Average   | Some evidence of deferred maintenance and normal obsolescence with age in that a few minor repairs are needed, along with some refinishing. All major components still functional and contributing toward an extended life expectancy. Effective age and utility is standard for like properties of its class and usage.
  `4`  |    Good     | No obvious maintenance required but neither is everything new. Appearance and utility are above the standard and the overall effective age will be lower than the typical property.
  `5`  |  Very Good  | All items well maintained, many having been overhauled and repaired as they have shown signs of wear, increasing the life expectancy and lowering the effective age with little deterioration or obsolescence evident with a high degree of utility.

<div style="text-align: right">(Source: <a href="https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r">King County Government</a>)</div>

### Feature `'grade'`

The meaning of the feature `'grade'` is as follows:

  value  | meaning
:-------:|:--------
 `1`-`3` | Falls short of minimum building standards. Normally cabin or inferior structure.
   `4`   | Generally older, low quality construction. Does not meet code.
   `5`   | Low construction costs and workmanship. Small, simple design.
   `6`   | Lowest grade currently meeting building code. Low quality materials and simple designs.
   `7`   | Average grade of construction and design. Commonly seen in plats and older sub-divisions.
   `8`   | Just above average in construction and design. Usually better materials in both the exterior and interior finish work.
   `9`   | Better architectural design with extra interior and exterior design and quality.
  `10`   | Homes of this quality generally have high quality features. Finish work is better and more design quality is seen in the floor plans. Generally have a larger square footage.
  `11`   | Custom design and higher quality finish work with added amenities of solid woods, bathroom fixtures and more luxurious options.
  `12`   | Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.
  `13`   | Generally custom designed and built. Mansion level. Large amount of highest quality cabinet work, wood trim, marble, entry ways etc.
  
<div style="text-align: right">(Source: <a href="https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r">King County Government</a>)</div>

# Part 1: Data mining

We import the data set, explore it briefly and drop duplicates.

## Imports

### Modules, classes and functions

In [1]:
# data
import pandas as pd

# custom modules
from modules.ds import data_type_info

We set the options to print all columns and to print float columns with two decimals.

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

### Data

We import our data. The value `''`, which occurs for the features `'waterfront'` and `'yr_renovated'`, and the value `'?'`, which occurs for the feature `'sqft_basement'`, will be imported as `NaN`.

In [3]:
houses = pd.read_csv('data/King_County_House_prices_dataset.csv', index_col='id', na_values=['', '?'])

## First exploration

We check whether the import worked as expected.

In [4]:
houses

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7129300520,10/13/2014,221900.00,3,1.00,1180,5650,1.00,,0.00,3,7,1180,0.00,1955,0.00,98178,47.51,-122.26,1340,5650
6414100192,12/9/2014,538000.00,3,2.25,2570,7242,2.00,0.00,0.00,3,7,2170,400.00,1951,1991.00,98125,47.72,-122.32,1690,7639
5631500400,2/25/2015,180000.00,2,1.00,770,10000,1.00,0.00,0.00,3,6,770,0.00,1933,,98028,47.74,-122.23,2720,8062
2487200875,12/9/2014,604000.00,4,3.00,1960,5000,1.00,0.00,0.00,5,7,1050,910.00,1965,0.00,98136,47.52,-122.39,1360,5000
1954400510,2/18/2015,510000.00,3,2.00,1680,8080,1.00,0.00,0.00,3,8,1680,0.00,1987,0.00,98074,47.62,-122.05,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263000018,5/21/2014,360000.00,3,2.50,1530,1131,3.00,0.00,0.00,3,8,1530,0.00,2009,0.00,98103,47.70,-122.35,1530,1509
6600060120,2/23/2015,400000.00,4,2.50,2310,5813,2.00,0.00,0.00,3,8,2310,0.00,2014,0.00,98146,47.51,-122.36,1830,7200
1523300141,6/23/2014,402101.00,2,0.75,1020,1350,2.00,0.00,0.00,3,7,1020,0.00,2009,0.00,98144,47.59,-122.30,1020,2007
291310100,1/16/2015,400000.00,3,2.50,1600,2388,2.00,,0.00,3,8,1600,0.00,2004,0.00,98027,47.53,-122.07,1410,1287


The dataframe `houses` has `21597` rows and `20` columns, i.e. we have `21597` instances, one target (`'price'`) and `19` features.

We explore the current data types, the number of unique values and the number of NA values of all features.

In [5]:
data_type_info(houses)

Unnamed: 0,dtype,n_unique,p_unique,n_na,p_na
date,object,372,0.02,0,0.0
price,float64,3622,0.17,0,0.0
bedrooms,int64,12,0.0,0,0.0
bathrooms,float64,29,0.0,0,0.0
sqft_living,int64,1034,0.05,0,0.0
sqft_lot,int64,9776,0.45,0,0.0
floors,float64,6,0.0,0,0.0
waterfront,float64,2,0.0,2376,0.11
view,float64,5,0.0,63,0.0
condition,int64,5,0.0,0,0.0


The features `'waterfront'`, `'view'`, `'sqft_basement'` and `'yr_renovated'` have NA values.

## Dropping duplicates

We investigate those instances, whose index is equal.

In [6]:
houses[houses.index.duplicated(keep=False)][:30]

Unnamed: 0_level_0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
6021501535,7/25/2014,430000.0,3,1.5,1580,5000,1.0,0.0,0.0,3,8,1290,290.0,1939,0.0,98117,47.69,-122.39,1570,4500
6021501535,12/23/2014,700000.0,3,1.5,1580,5000,1.0,0.0,0.0,3,8,1290,290.0,1939,0.0,98117,47.69,-122.39,1570,4500
4139480200,6/18/2014,1380000.0,4,3.25,4290,12103,1.0,0.0,3.0,3,11,2690,1600.0,1997,0.0,98006,47.55,-122.1,3860,11244
4139480200,12/9/2014,1400000.0,4,3.25,4290,12103,1.0,0.0,3.0,3,11,2690,1600.0,1997,0.0,98006,47.55,-122.1,3860,11244
7520000520,9/5/2014,232000.0,2,1.0,1240,12092,1.0,,0.0,3,6,960,280.0,1922,1984.0,98146,47.5,-122.35,1820,7460
7520000520,3/11/2015,240500.0,2,1.0,1240,12092,1.0,0.0,0.0,3,6,960,280.0,1922,1984.0,98146,47.5,-122.35,1820,7460
3969300030,7/23/2014,165000.0,4,1.0,1000,7134,1.0,0.0,0.0,3,6,1000,0.0,1943,0.0,98178,47.49,-122.24,1020,7138
3969300030,12/29/2014,239900.0,4,1.0,1000,7134,1.0,0.0,0.0,3,6,1000,0.0,1943,,98178,47.49,-122.24,1020,7138
2231500030,10/1/2014,315000.0,4,2.25,2180,10754,1.0,,0.0,5,7,1100,1080.0,1954,0.0,98133,47.77,-122.34,1810,6929
2231500030,3/24/2015,530000.0,4,2.25,2180,10754,1.0,0.0,0.0,5,7,1100,1080.0,1954,0.0,98133,47.77,-122.34,1810,6929


In most cases, the target value (`'price'`) for the second (last) instance of each group of duplicates seems to be higher than the target value for the first instance(s). However, the values of the features seem to be equal. So we keep those last instances as otherwise the values of the features might be wrong.

In [7]:
houses = houses.loc[~houses.index.duplicated(keep='last')]

## Reordering features

We reorder the features.

In [8]:
houses = houses.reindex(columns=['price', 'date', 'sqft_living', 'sqft_above', 'sqft_basement', 'sqft_lot',
                                 'sqft_living15', 'sqft_lot15', 'bedrooms', 'bathrooms', 'floors', 'yr_built',
                                 'yr_renovated', 'lat', 'long', 'zipcode', 'condition', 'grade', 'view', 
                                 'waterfront'])

## Documentation of features

In [9]:
features = pd.DataFrame([
    {'feature': 'date', 'description': 'date when object was sold', 'type': 'date time'},
    {'feature': 'sqft_living', 'description': 'square footage of interior housing living space',
     'type': 'continuous (numeric)'},
    {'feature': 'sqft_above', 'description': 'square footage of house apart from basement',
     'type': 'continuous (numeric)'},
    {'feature': 'sqft_basement', 'description': 'square footage of basement',
     'type': 'continuous (numeric)'},
    {'feature': 'sqft_lot', 'description': 'square footage of land lot', 'type': 'continuous (numeric)'},
    {'feature': 'sqft_living15',
     'description': 'square footage of interior housing living space for the nearest 15 neighbors',
     'type': 'continuous (numeric)'},
    {'feature': 'sqft_lot15', 'description': 'square footage of land lots of the nearest 15 neighbors',
     'type': 'continuous (numeric)'},
    {'feature': 'bedrooms', 'description': 'number of bedrooms in house', 'type': 'discrete (numeric)'},
    {'feature': 'bathrooms', 'description': 'number of bathrooms in house', 'type': 'discrete (numeric)'},
    {'feature': 'floors', 'description': 'total floors (levels) in house', 'type': 'discrete (numeric)'},
    {'feature': 'yr_built', 'description': 'year when house was built', 'type': 'discrete (numeric)'},
    {'feature': 'yr_renovated', 'description': 'year when house was renovated',
     'type': 'discrete (numeric)'},
    {'feature': 'lat', 'description': 'latitude coordinate of object', 'type': 'continuous (numeric)'},
    {'feature': 'long', 'description': 'longitude coordinate of object', 'type': 'continuous (numeric)'},
    {'feature': 'zipcode', 'description': 'zipcode of area where house is positioned',
     'type': 'nominal (categorical)'},
    {'feature': 'condition', 'description': 'overall condition of object', 'type': 'ordered (categorical)'},
    {'feature': 'grade',
     'description': 'overall grade given to the housing unit, based on King County grading system',
     'type': 'ordered (categorical)'},
    {'feature': 'view', 'description': 'indicator for views of object', 'type': 'ordered (categorical)'},
    {'feature': 'waterfront', 'description': 'indicator whether house lies at a waterfront',
     'type': 'boolean'}
]).astype('string').set_index('feature')

## Summary

In [10]:
houses.sample(5, random_state=0)

Unnamed: 0_level_0,price,date,sqft_living,sqft_above,sqft_basement,sqft_lot,sqft_living15,sqft_lot15,bedrooms,bathrooms,floors,yr_built,yr_renovated,lat,long,zipcode,condition,grade,view,waterfront
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
3226049530,465000.0,1/22/2015,2010,1290,720.0,7264,1510,7326,5,3.0,1.0,1990,,47.69,-122.33,98103,3,7,0.0,0.0
2877102330,772000.0,5/16/2014,2110,2110,0.0,3750,1700,5000,4,2.5,2.0,2000,0.0,47.68,-122.36,98117,3,8,0.0,0.0
8651443420,280000.0,10/17/2014,1710,1030,680.0,5440,1620,6696,4,2.0,1.0,1976,0.0,47.37,-122.09,98042,5,8,0.0,0.0
9521101520,543000.0,12/12/2014,940,940,0.0,3864,1440,3956,2,1.0,1.0,1918,0.0,47.66,-122.34,98103,4,8,0.0,0.0
1026069134,619000.0,8/25/2014,2560,2560,0.0,43608,3000,54088,3,2.5,2.0,2002,0.0,47.76,-122.03,98077,3,9,0.0,0.0


In [11]:
data_type_info(houses)

Unnamed: 0,dtype,n_unique,p_unique,n_na,p_na
price,float64,3595,0.17,0,0.0
date,object,372,0.02,0,0.0
sqft_living,int64,1034,0.05,0,0.0
sqft_above,int64,942,0.04,0,0.0
sqft_basement,float64,303,0.01,452,0.02
sqft_lot,int64,9776,0.46,0,0.0
sqft_living15,int64,777,0.04,0,0.0
sqft_lot15,int64,8682,0.41,0,0.0
bedrooms,int64,12,0.0,0,0.0
bathrooms,float64,29,0.0,0,0.0


In [12]:
features

Unnamed: 0_level_0,description,type
feature,Unnamed: 1_level_1,Unnamed: 2_level_1
date,date when object was sold,date time
sqft_living,square footage of interior housing living space,continuous (numeric)
sqft_above,square footage of house apart from basement,continuous (numeric)
sqft_basement,square footage of basement,continuous (numeric)
sqft_lot,square footage of land lot,continuous (numeric)
sqft_living15,square footage of interior housing living spac...,continuous (numeric)
sqft_lot15,square footage of land lots of the nearest 15 ...,continuous (numeric)
bedrooms,number of bedrooms in house,discrete (numeric)
bathrooms,number of bathrooms in house,discrete (numeric)
floors,total floors (levels) in house,discrete (numeric)


## Train-test-split

In [13]:
from sklearn.model_selection import train_test_split

(houses_train, houses_test) = train_test_split(houses, random_state=0)

## Save data set

We save the data set.

In [14]:
houses_train.to_pickle('data/king_county_train_1.pickle')
houses_test.to_pickle('data/king_county_test_1.pickle')
features.to_pickle('data/features_1.pickle')