<h1> Eploring the Zillow prize dataset </h1>

In this notebook we are going to explore the provided datasets in order to better understand the problem at hand. At the same time we will try to arrive at preliminary conclusions on the distribution of our data in order to derive useful insights for the modelling phase.

Lets start by loading the training dataset for inspection:

In [None]:
import pandas as pd

train_features = pd.read_csv('data/train_features.csv')

<h3> Renaming the columns </h3>

It is obvious that these column names are hard to interpret, we will therefore use a custom mapping. Since
this falls more into the data layer than the analysis we will read the mapping from a file. The expected format
is a key value pair per line split by the "=" symbol. 

For example: 

`newColumnName = oldColumnName`

In [None]:
col_mapping = pd.read_csv('data/feature_names', sep ="=", header=None).applymap(str.strip)
mapping_dict = dict(zip(col_mapping[1], col_mapping[0]))
train_features.rename(columns = mapping_dict, inplace=True)

(num_records, num_features) = train_features.shape
print('There are {0} properties recorded and {1} features in total'.format(*train_features.shape))

<h3> Handling sparse columns </h3>

There are quite a lot of features, however some of them have a lot of missing values. Lets try to quantify that statement

In [None]:
nan_count = train_features.isnull().mean()
nan_count[nan_count > 0.95].sort_values(ascending=False)

<h4> More than 1/3 of our features are sparse! </h4>

Several of these sparse features could be important such as the tax delinquency. In others however a missing value can be easily imputed. For example missing values for the pool type and area probably mean that the property does not have a pool installed 

<i>These missing values may or may not pose a significant problem at the modelling phase</i>

<h3> Checking the target variable </h3>

Lets now take a look at our target variable: <b> The log Error </b>

In [None]:
labels = pd.read_csv('data/train_label.csv')

# Rename is needed to facilitate a left join
labels.rename(columns={'parcelid': 'ID'}, inplace=True)

<h3> A lot of missing labels </h3>

Well it seems that most of the properties found in our feature dataset do not have a label associated with them.
This could probably come from the fact that a prediction error can only be computed when the real selling price is recorded, that is when the property is actually sold. Of course not every property was sold within the time limits of the data collection. This will significantly reduce our training set since we will only keep records where the label is known for the modelling phase.

In [None]:
merged = train_features.merge(labels, on='ID')