# Flatiron Data Science Project 1: The King County Housing Data Set
### Matthew Freeman -- Work Beginning on 3rd March 2019

#### The project will use data science methods in the Python 3 language to explore the King County Housing Data Set. This data set contains real data points on house sale prices within King County, WA, U.S.A. and some house details which may or may not have influenced pricing. This investigation shall be using a multivariate linear regression to try to create a model which can predict house prices from these added details, and will also attempt to answer three questions posed by myself below.

### Question A: What can be inferred about this data set from its exposure to misfitting?

### Question B: Where are the higher valued houses in King County located AND how best can I improve my model with location related data?

### Question C: How much more accurate and reliable can a price prediction be based on a multivariate linear regression rather than the single most correlated variable?

     
#### - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

#### This investigation will roughly follow the OSEMN data science process, with some iteration and backward steps being employed where investigation requires it. 
#### This means that I will be trying to carry out the investigation in five steps. First, I shall Obtain the data: gathering whatever is needed from the required sources. Second, I shall Scrub the data set: finding missing or incorrectly labelled data points and preparing the data for the best analysis possible. Third, I shall Explore the dataset: looking for patterns and anomalies across statistical distributions and correlations which can inform my investigation strategy. Fourth, I shall Model the data: iterating different models to settle on one with the most significant predictive power and using appropriate methods to check validity. Fifth, and finally, I shall iNterpret the results of our investigation: I shall discuss their predictive reliability and  their success in answering the questions.

#### - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

## Step 1: Obtain

##### Aims: Import libraries and functions to be used throughout investigation; detail kernel used; apply any plotting settings; and most importantly, store data set in a dataframe for easy manipulation.

In [2]:
# Note: This code is written in Python 3 using the "learn-env" kernel.
# Import pandas for dataframe usage.
import pandas as pd
# Import matplotlib.pyplot for basic plotting.
import matplotlib.pyplot as plt
# Import seaborn for advanced plotting and plot styling.
import seaborn as sns
# Import numpy for mathematical functions.
import numpy as np
# Import statsmodels for statistics functions.
import statsmodels.formula.api as smf

# Set plotting style and appearance magic.
plt.style.use('ggplot')
%matplotlib inline

In [5]:
# The data set is saved as kc_house_data.csv within this directory. I shall read it into a Pandas dataframe.
kc_df = pd.read_csv('kc_house_data.csv')

## Step 2: Scrub

##### Aims: Look through data for missing values, mislabelled data, poorly captured data points or categories, etc.; seek most appropriate solution; fix data as best possible.

In [13]:
# This data set is small enough in size that I am confident it would be possible to get a good feel for it
# by displaying the first few rows.
kc_df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,0.0,...,7,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,0.0,0.0,...,7,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,0.0,0.0,...,6,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,0.0,0.0,...,7,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,0.0,0.0,...,8,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [14]:
# Now let's check the column data types and sizes, as well as see all the variables available to us.
kc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
id               21597 non-null int64
date             21597 non-null object
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
lat              21597 non-null float64
long             21597 non-null float64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(8), int64(11), object(2)
memory usage: 3.5+ MB


Interesting. There are 21 columns and I can see already that some are not in the correct data type or are shorter than others. Let's go through these columns in detail.

id: This variable relates to numbers identifying the property sold. As we have no key relating to these properties id codes, or any knowledge of methodology in assigning these id codes, we can remove this column as it is useless to us. We can simply rely on the pandas index for this dataframe for such a purpose.

date: This variable contains dates, entered in a text format. Dates can be most easily understood by python if they are reformatted into a purely numerical date format, which pandas functionality exists to allow. There may likely also be incorrectly written dates in this column which we must check for and deal with as best as possible.

price: This variable contains assumedly USD sale prices for the properties. They are mostly rounded to 1 decimal place which is an odd choice when most properties have a price rounded to the nearest hundred or thousand dollars anyway. After better understanding how erroneous data points may be included it would be preferable to round up to the nearest dollar as it would look neater in plots and a single decimal point is no more correct a rounding choice. 
This variable is the PREDICTION TARGET and should also be copied into a separate pandas series for later use.

bedrooms: This variable contains low integers representing the number of bedrooms in the house sold. Thankfully the data type of the series seems appropriate for this. As long as there are no mistakenly high or low numbers this may be as accurate as possible.

bathrooms: This variable should be the same as the bedrooms variable, but for the number of bathrooms. Sadly, the series data type seems to be including more than just integers as it is a floating point. This should be fixed.

sqft_living, sqft_lot: These variables are for the square foot area of the living space and lot, respectively. They are recorded as appropriately sized integers so they should probably not require much scrubbing.



In [None]:
#Fix columns 1

In [None]:
#Fix columns 2

In [None]:
#Etc

## Step 3: Explore

##### Aims: Check statistical measures of data set variables, analyse distributions of data set variables, locate anomalies, and consider different uses for variables.

## Step 4: Model

##### Aims: Create multivariate linear regression model, iterating for best variables to use or avoid, best split ratio to use, and check for validity.

## Step 5: Interpret

##### Aims: Discuss predictive power and reliability of model, answer questions, debate level of success in answering questions, and discuss proposals for future improvements to the model.  