### Dotscience settings

Here we tell dotscience which dots are used for input and output, as well as labels which can later be queried.

In [None]:
print('DOTSCIENCE_INPUTS=["agent1", "agent2"]')
print('DOTSCIENCE_OUTPUTS=["combined-houses"]')
print('DOTSCIENCE_LABELS={"operation": "data_wrangling", "ignore": "1"}')

## Data wrangling - combining house price data from two real estate agents

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

We are using a housing price dataset sourced from Bay Area Home Sales Database and Zillow. This dataset was based on the homes sold between January 2013 and December 2015. 

In [None]:
inputs = [pd.read_csv('./agent1/bay_area_zillow_agent1.csv'), pd.read_csv('./agent2/bay_area_zillow_agent2.csv')]
df = pd.concat(f for f in inputs)

In [None]:
# randomise my dataframe rows to remove any ordering in the data
df = df.sample(frac=1, random_state=1).reset_index(drop=True)

In [None]:
df.describe()

In [None]:
# drop unneeded columns
df.drop(df.columns[[0, 2, 3, 15, 17, 18]], axis=1, inplace=True)

In [None]:
# TODO: add some NaN lines to one of the CSVs, spot them in the Python, and then delete them

In [None]:
# check none of our data is null or NaN
df.isnull().any()

Looks like the data is clean, let's save it to `combined-houses` for our data science team to use!

In [None]:
df.to_csv('./combined-houses/combined-filtered-housing-data.csv')

---
### Dotscience parameters & summary

For now, we just write some dummy summary statistic to trigger dotscience to version the data wrangling job.

In [None]:
len(df)

In [None]:
import json
print('DOTSCIENCE_PARAMETERS=' + json.dumps({}))
print('DOTSCIENCE_SUMMARY=' + json.dumps({"rows_processed": len(df)}))