## House Prices: How to work offline
This is an example script for working 'offline' on the [House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition. When working online you are limited to submitting a maximum of 5 entries per day. However, you may find this limit restrictive if you are trying out many ideas at the same time. The solution to this competition is in the public domain, so by either adding the [House Prices: Advanced Regression 'solution' file](https://www.kaggle.com/carlmcbrideellis/house-prices-advanced-regression-solution-file) to your notebook using the **+ Add data** option found on the top right of your notebook, or by downlading the `solution.csv` file locally to your computer, you can instead work totally offline. This will open up the possibility of experimenting with advanced techniques such as pipelines with various estimators in the same file, extensive hyper-parameter tuning etc.

Below is an example script, which loads in both the competition files and the solution file, performs a simple random forest regression, and then evaluates the score, which is calculated using the root of the [mean squared logarithmic error regression loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html), just as for the competition leaderboard, using [the following equation](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-log-error): <br>

$$ {\mathrm {RMSLE}}\,(y, \hat y) = \sqrt{ \frac{1}{n_{   \mathrm{samples}    }}  \sum_{i=0}^{n_{    \mathrm{samples} }-1} \left( \ln (1+y_i) - \ln (1+ \hat y_i) \right)^2 }  $$

where $\hat y_i$ is the predicted value of the $i$-th sample, and $y_i$ is the corresponding true value.

**Note:** The score returned is not *exactly* the same as that given when you submit to the public leaderboard. This is because only 50% of predictions from the test set are assigned to the public leaderboard. The score you see on the public leaderboard reflects your model’s accuracy on this portion of the test set, whereas here you are using 100% of the predictions. Ideally you should also perform such a split, in order to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting).

Here is the code. Please feel totally free to make a *fork* and then replace my trivial feature engineering and estimator with your own magnificent work!

In [None]:
#!/usr/bin/python3
# coding=utf-8
#===========================================================================
# This is a minimal script to perform a regression on the kaggle 
# 'House Prices' data set.
#===========================================================================
#===========================================================================
# load up the libraries
#===========================================================================
import pandas  as pd
import numpy   as np

#===========================================================================
# read in the competition data 
#===========================================================================
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data  = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

#===========================================================================
# also, read in the 'solution' data 
# Note: you either need to use "+ Add data" to include this file if you are woking on kaggle,
# or download it and store it locally if you are completely offline
#===========================================================================
solution   = pd.read_csv('../input/house-prices-advanced-regression-solution-file/solution.csv')
y_true     = solution["SalePrice"]
                         
#===========================================================================
# select some features of interest
#===========================================================================
features = ['OverallQual', 'GrLivArea', 'GarageCars',  'TotalBsmtSF']

#===========================================================================
#===========================================================================
X_train       = train_data[features]
y_train       = train_data["SalePrice"]
final_X_test  = test_data[features]

#===========================================================================
# essential preprocessing: imputation; substitute any 'NaN' with mean value
#===========================================================================
X_train      = X_train.fillna(X_train.mean())
final_X_test = final_X_test.fillna(final_X_test.mean())

#===========================================================================
# perform the regression and then the fit
#===========================================================================
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, max_depth=7)
regressor.fit(X_train, y_train)

#===========================================================================
# use the model to predict the prices for the test data
#===========================================================================
y_pred = regressor.predict(final_X_test)

#===========================================================================
# compare your predictions with the 'solution' using the 
# root of the mean_squared_log_error
#===========================================================================
from sklearn.metrics import mean_squared_log_error
RMSLE = np.sqrt( mean_squared_log_error(y_true, y_pred) )
print("The score is %.5f" % RMSLE )

When you are finally ready to submit your work to the leaderboard, you can produce a `submission.csv` with the following code:

In [None]:
#===========================================================================
# write out CSV submission file
#===========================================================================
output = pd.DataFrame({"Id":test_data.Id, "SalePrice":y_pred})
output.to_csv('submission.csv', index=False)