# Predict House Prices in King County, Washington  

King County, Washington is the most populous county in Washington. It is also the 12th most populous in the United States. Washington's most populous city is Seattle.My goal is to predict the price of a house based on houses sold between May 2014 to May 2015 in King County, Washington State, USA for and create a model that home buyers, home sellers, and online property listings can use to determine the price of a home.

## Table of Contents:
* [Data Collection](#DataCollection)
* [Data Organization](#DataOrganization)
* [Data Definition](#DataDefinition)
* [Data Cleaning](#DataCleaning)

In [2]:
#import libraries
import pandas as pd

# 1. Data Collection <a class="anchor" id="DataCollection"></a>

In [3]:
#get the data and read it 
housingData = pd.read_csv('../Data/kc_house_data.csv')

# 2. Data Organization <a class="anchor" id="DataOrganization"></a>

In [4]:
#View The Data
housingData.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503


In [5]:
# View Summary of Data
housingData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21611 non-null  float64
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [6]:
#Check the numer of rows in the dataframe
len(housingData.index)

21613

# 3. Data Definition

This is a list of column names with their description and data type


<b>id:</b> a notation for a house - Numeric

<b>date:</b> Date house was sold - String

<b>price:</b> Price is prediction target - Numeric

<b>bedrooms:</b> Number of Bedrooms/House - Numeric

<b>bathrooms:</b> Number of bathrooms/bedrooms - Numeric

<b>sqftliving:</b> square footage of the home - Numeric 

<b>sqftlot:</b> square footage of the lot - Numeric

<b>floors:</b> Total floors (levels) in house - Numeric

<b>waterfront:</b> House which has a view to a waterfront - Numeric

<b>view:</b> Has been viewed - Numeric

<b>condition:</b> How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent.(http://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#g) - Numeric

<b>grade:</b> overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent. - Numeric

<b>sqftabove:</b> square footage of house apart from basement - Numeric 

<b>sqftbasement:</b> square footage of the basement - Numeric

<b>yrbuilt:</b> Built Year - Numeric 

<b>yrrenovated:</b> Year when house was renovated - Numeric

<b>zipcode:</b> zip - Numeric

<b>lat:</b> Latitude coordinate - Numeric

<b>long:</b> Longitude coordinate - Numeric

<b>sqftliving15:</b> Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area - Numeric 

<b>sqftlot15:</b> lotSize area in 2015(implies-- some renovations) - Numeric

In [5]:
#2. Check the data types of the columns
housingData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21611 non-null  float64
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

# 4. Data Cleaning

In [7]:
#check for duplicates
housingData.duplicated()

duplicate = housingData[housingData.duplicated()]
print(duplicate)

Empty DataFrame
Columns: [id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15]
Index: []

[0 rows x 21 columns]


There are no duplicates. If there were duplicates, I could use drop() to drop the duplicated items

In [27]:
#check for any null values
housingData.isnull().values.any()

#check how many null values
housingData.isnull().sum()

# 2 null values are in the sqft_above column > index 10 & 17
housingData[housingData['sqft_above'].isna()]

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
10,1736800520,20150403T000000,662500.0,3,2.5,3560,9796,1.0,0,0,...,8,,1700,1965,0,98007,47.6007,-122.145,2210,8925
17,6865200140,20140529T000000,485000.0,4,1.0,1600,4300,1.5,0,0,...,7,,0,1916,0,98103,47.6648,-122.343,1610,4300


In [28]:
#Remove the rows where sqft_above is not NA
newHousingDF = housingData[housingData['sqft_above'].notna()]

newHousingDF.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64