<b style="font-size:2em;">Final Project Submission</b>

* Student name: Edward Amor 
* Student pace: part time
* Scheduled project review date/time: 
* Instructor name: Victor Geislinger
* Blog post URL: 

# Business Understanding

## Background

A real estate company in the King County area wants to investigate house sales in order to predict pricing. Our task is to clean, explore, and model the housing data they have supplied to us in `kc_house_data.csv`, and identify house features which significantly effect house price. 

## Scope

To identify which house features significantly effect house price, we will use standard EDA practices, hypothesis testing, and regression modeling to come to our conclusions.

For our client's consumption the results of our analysis will be condensed into a presentation. 

## Goal

Our goal is to generate a multivariate linear regression model which as accurately as possible predicts the sale price of houses.  

## Objectives

1. Data Acquisition
2. Data Understanding
  - Data Cleaning
  - Data Exploration
  - Data Visualization
3. Modeling
  - Feature Engineering
    - Feature Transformation
    - Feature Selection
  - Training
  - Model Evaluation

# Data Acquisition

The data for this project can be found in the file `kc_house_data.csv`.

Within this repository is also `data_dictionary.csv` which describes the data features.

# Data Understanding

According to our data dictionary the features of our dataset are as follows:

| column        | description                                                  |
| ------------- | ------------------------------------------------------------ |
| id            | unique identified for a house                                |
| date          | Date house was sold                                          |
| price         | Price is prediction target                                   |
| bedrooms      | Number of Bedrooms/House                                     |
| bathrooms     | Number of bathrooms/bedrooms                                 |
| sqft_living   | square footage of the home                                   |
| sqft_lot      | square footage of the lot                                    |
| floors        | Total floors (levels) in house                               |
| waterfront    | House which has a view to a waterfront                       |
| view          | Has been viewed                                              |
| condition     | How good the condition is ( Overall )                        |
| grade         | overall grade given to the housing unit                      |
| sqft_above    | square footage of house apart from basement                  |
| sqft_basement | square footage of the basement                               |
| yr_built      | Built Year                                                   |
| yr_renovated  | Year when house was renovated                                |
| zipcode       | zip                                                          |
| lat           | Latitude coordinate                                          |
| long          | Longitude coordinate                                         |
| sqft_living15 | The square footage of interior housing living space for the nearest 15 neighbors |
| sqft_lot15    | The square footage of the land lots of the nearest 15 neighbors |

Of the 20 columns most notable is the price feature which will be our target during linear regression.

Another thing to note is we have location data. This could be used to provide some insight into pricing, but may also be of use when delivering key information to our client.

Some notable things to look into while we explore the data:

1. How does location and price relate?
  - Since we are given location data, it would be insightful to see how geographic position affects pricing.
2. Where are the houses with the poorest grades?
  - It would be interesting if there exists a pattern between geographic position, grade, and pricing.
3. Are waterfront houses in more demand?
  - We can tell if demand is high, by the price compared to non-waterfront houses. (Hypothesis test?)

## Data Cleaning

For this section we will inspect our dataset cleaning up inconsistencies such as, null values, duplicates, and incorrect data types.

In [None]:
# import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
# read in dataframe and output dataframe info
df = pd.read_csv("kc_house_data.csv")
df.info()

Our raw dataset contains **21597 records**, and has **21 features**.

The data types found within our dataset are: `float64`, `int64`, `object`.

Next we will inspect our features more closely, especially the ones which are of `object` data type. Just to make sure they're the correct data type.

In [None]:
# Output the first 5 rows
display(df.head(5))

The `date` column should be transformed to the `datetime` data type instead of `object`.

The `sqft_basement` column needs to be further inspected, the values appear to be `float` there may be some `string` values.

In [None]:
# convert date column from string data type to a datetime
df['date'] = pd.to_datetime(df['date'])

In [None]:
# inspect the values of the sqft_basement column
display(df['sqft_basement'].value_counts())

There is a '?' value, this may mean that the value is unknown. We will replace this with a null value, and then convert the column into a float.

In [None]:
# replace ? with np.nan in the sqft_basement column and convert the column to float
df['sqft_basement'] = df['sqft_basement'].replace('?', np.nan).astype('float64')

In [None]:
# redisplay the values in the sqft_basement column
display(df['sqft_basement'].value_counts())

All our columns are now numeric, it's time to inspect for any interesting anomalies.

We will start out by graphing our distributions.

In [None]:
# plot a distribution plot for each feature
fig, ax = plt.subplots(7, 3, figsize=(16, 9*3)) # 21 sub plots

for col, axes in zip(df.columns, ax.flatten()):
    sns.distplot(df[col], kde=False, rug=True, ax=axes)

plt.show()

Columns to disregard when cleaning:

- `id`
- `date`
- `long`
- `lat`
- `zipcode`

From the histograms it looks like we may want to drop the `yr_renovated` column, we will have to investigate further.

There are quite a few columns which should be transformed into the `categorical` data type.

One thing to notice is for some features such as `yr_renovated` and `sqft_lot` we have a lot of near zero values. Needs further investigation.

In [None]:
# inspect the yr_renovated column by plotting a violinplot
fig, ax = plt.subplots(figsize=(16, 9))

sns.violinplot('yr_renovated', data=df, width=.5)
plt.show()

In [None]:
# check the value counts for the y_renovated column
display(df['yr_renovated'].value_counts().head())

An extremely large percentage of the `yr_renovated` column has the value 0, meaning no renovation has occurred. Instead of removing the column, we can convert the values other than 0 to a 1. Giving us a usable feature instead of dropping one.

In [None]:
# transform the values other than 0 to a 1 in yr_renovated, and rename the column
mask = df['yr_renovated'] != 0
df.loc[mask, 'yr_renovated'] = 1
df.rename(columns={'yr_renovated': 'renovated'}, inplace=True)