# Data exploration of House Price Predictions


The data is downloaded from Kaggle and contains data about houses and around ~80 variables. The goal is to use these explanatory variables to predict the House Prices. Here, we are dealing with a regression problem 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

This notebook is to explore the data, to understand the basic relationships between the variables and to get a feeling about which variables might be good predictors for the House prices. There will be a separate notebook containing statistical and machine learning models for the predictions.

Author: Julia Hammerer, Vanessa Mai
Last Changes: 18.11.2018

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-exploration-of-House-Price-Predictions" data-toc-modified-id="Data-exploration-of-House-Price-Predictions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data exploration of House Price Predictions</a></span><ul class="toc-item"><li><span><a href="#Data-Profile" data-toc-modified-id="Data-Profile-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Profile</a></span></li><li><span><a href="#Missing-values" data-toc-modified-id="Missing-values-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Missing values</a></span></li><li><span><a href="#Sanity-checks" data-toc-modified-id="Sanity-checks-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Sanity checks</a></span></li><li><span><a href="#Cleanse-data" data-toc-modified-id="Cleanse-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Cleanse data</a></span></li><li><span><a href="#Statistics" data-toc-modified-id="Statistics-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Statistics</a></span></li></ul></li></ul></div>

In [None]:
import sys
sys.path.insert(0, '../helper/')

In [None]:
# load packages
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import pandas_profiling
import missingno as msno
import matplotlib.pyplot as plt
import plotly.graph_objs as go

from pandas.tools.plotting import table
from plotly.offline import init_notebook_mode
from plotly.offline import iplot
from plotly.offline import plot
from scipy.stats import mannwhitneyu
from statsmodels.distributions.empirical_distribution import ECDF
from scipy import stats
from scipy.stats import pearsonr

from helper import na_ratio_table

In [None]:
#load data
# we have two files, since this is a part of a kaggle competition,
# only the training-set contains the target variable
# we will use that for the whole analysis

df=pd.read_csv("../data/house_prices_train.csv")

In [None]:
print("Number of records and variables: ",df.shape)

## Data Profile

In [None]:
# for a first overview, we apply the pandas-profile report
# it provides simple histograms, distributions, missingness 
# and correlations for all variables

pandas_profiling.ProfileReport(df)

A description of all data fields can be found on the Kaggle site: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. Most of them are self-explanatory though.

Around half the variables are categorical and the other half are numerical. For the categorical variables there will be need to use hot-one-encoding for incorporating them into the prediction models.

We can already spot some correlations that look promising. Some of them
also are also expected and won't give us further insights. We are particularly
interested in correlations with our target variable
- OverallQual - SalePrice
- GrLivArea - SalePrice
- FullBatch - SalePrice
- GarageYrBlt - YearRemodAdd
- LotFrontage - lotArea
- TotRmsAbvGrd - GrLivArea
- BsmtUnfSF - BsmtFinSF1: negative correlation

Also, we can detect variables that probably won't be of much use
e.g.
- Street: only two values, of which one is extremly low.
- Utilities: Almost constant with two values, of which the other one has only one record

## Missing values
Let's check the missingness in more detail

In [None]:
display(na_ratio_table(df)[na_ratio_table(df)["NA_COUNT"]>0])
display(na_ratio_table(df)[na_ratio_table(df)["NA_COUNT"]>0].shape)


We have 19 variables that contain missing values. Most of them mean that the feature is simply not available for that property. However for a few, this can indicate a data quality issue:
- Electrical: the type is not stated, it is improbable that there is no electrical system at all. 
- LotFrontage: a building should always have a lotfrontage

As for "Electrical" only one record is missing, we can simply filter this out, or even ignore this. For the LotFrontage we can apply some imputation-techniques if necessary.

In [None]:
# we test if the data is randomly missing, or if there are some patterns in the missingness
# this helps us indicate whether there are data quality issues or if the missingness is part of the data
msno.heatmap(df)

As expected, we can see that some of the variables are always missing together, which makes absolutely sense. 
Example: All Garage related variables are always missing together. Reason: no garage -> no values for any garage features.
The other group of variables missing together is related to the basement.

## Sanity checks
We're going to check if there are some inconsistencies in the data or duplicates, etc. (Quality assessment)

In [None]:
# any duplicates?
df[df.duplicated(keep=False)]


In [None]:
# any built year before sold year?
df.query('YearBuilt > YrSold')


## Cleanse data

In [None]:
df=df.drop(columns=["Id"])

## Statistics

In [None]:
df.describe()

In [None]:
# we check for further correlations using different plots
NUM_FEATURES =df.select_dtypes(include=[np.number]).columns.tolist()

df_num=df[NUM_FEATURES]

df_num_corr=df_num.corr()