# Part 9: Complete DNN code


**Complete code of DNN regressors**

Before summarising the results for these we first give the code in its entirety, with convenient parameters so that we can easily alter the different components. We make some slight changes of the code from part 8 for clarity. We also now include the possibility of outlier compensation, as we did in part 7 for linear regression.

We will then experiment with different parameters. One thing to note, the predictions have a certain amount of randomness (for example, dropout removes nodes randomly), so using the same parameters a second time may not necessarily give the same score. But the two scores should be similar.

In [16]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from matplotlib import pyplot as plt
import tensorflow as tf

# Import train and test data, save train SalePrice and test Id separately, remove train SalePrice and train and test Id
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
train.drop(['Id'], axis=1, inplace=True)
submission = test[['Id']]
test.drop(['Id'], axis=1, inplace=True)
train_sale_price = train[['SalePrice']]
train.drop(['SalePrice'], axis=1, inplace=True)

# Split features/columns into numerical and categorical lists
numeric_features_columns = list(train.select_dtypes(include=[np.number]).columns)
categorical_features_list_to_remove_from_numerical = ['MSSubClass', 'MoSold']
numeric_features_columns = list(set(numeric_features_columns) - set(categorical_features_list_to_remove_from_numerical))
train['MSSubClass'] = train['MSSubClass'].apply(str)
test['MSSubClass'] = test['MSSubClass'].apply(str)
train['MoSold'] = train['MoSold'].apply(str)
test['MoSold'] = test['MoSold'].apply(str)
categorical_features_columns = list(set(train.columns) - set(numeric_features_columns))

# Data processing parameters we can change
categorical_parameter = 0.95 # Remove all categorical features which are dominated by a single category. More exactly, remove for which number_of_largest_category/total_number_of_house_data_points is > categorical_parameter and <= 1. If you want nothing removed, take > 1.
weak_correlations_paramenter = 0.02  # Remove all features whose train column is weakly correlated with train SalePrice. More exactly, remove for which the correlation is > -weak_correlation_parameter and  < +weak_correlation_parameter. If you want nothing removed, take < 0.
strong_correlations_paramenter = 0.82 # Remove 1 feature from all pairs of features when their train columns are strongly correlated. More exactly, remove 1 feature from all pairs of features when the correlation is > +strong_correlation_parameter and <= +1, or >= -1 and  < -strong_correlation_parameter. If you want nothing removed, take > 1.
outlier_salePrice_upper_bound = 2 # Remove train rows/house-data for which the scaled train salePrice is >= outlier_salePrice_upper_bound and <=1. If you want nothing removed, take > 1.
outlier_salePrice_lower_bound = -1 # Remove train rows/house-data for which the scaled train salePrice is >= 0 and <= outlier_salePrice_lower_bound. If you want nothing removed, take < 0.
far_from_typical_outliers_with_bounds = [
#     ('LotArea', 0.98),
#     ('LotFrontage', 0.98),
#     ('MasVnrArea', 0.98),
#     ('BsmtFinSF1', 0.98),
#     ('TotalBsmtSF', 0.98),
#     ('2ndFlrSF', 0.98),
#     ('1stFlrSF', 0.98),
#     ('GrLivArea', 0.98),
#     ('BsmtFullBath', .98),
#     ('TotRmsAbvGrd', 1),
#     ('GarageArea', 0.98),
#     ('OpenPorchSF', 0.98),
#     ('MiscVal', 0.98)
] # Remove, for example, train house data for which the scaled train LotArea is >= 0.3 and <= 1. Similarly for the other features in the list. If you want nothing removed, leave list blank. If you want more removed, add features to the list.

# DNNRegressor parameters we can change, introduced in part 8
nodes = [1200]
activation_function = tf.nn.relu
drop = 0.35
steps = 10000

# Fill in the missing data
latest_year_house_sold = train['YrSold'].max()
train['GarageYrBlt'].fillna(latest_year_house_sold, inplace = True)
test['GarageYrBlt'].fillna(latest_year_house_sold, inplace = True)
train_numeric_features_with_missing_values = [feature for feature in numeric_features_columns if train[feature].isnull().sum() > 0]
test_numeric_features_with_missing_values = [feature for feature in numeric_features_columns if test[feature].isnull().sum() > 0]
train_categoric_features_with_missing_values = [feature for feature in categorical_features_columns if train[feature].isnull().sum() > 0]
test_categoric_features_with_missing_values = [feature for feature in categorical_features_columns if test[feature].isnull().sum() > 0]

for feature in train_numeric_features_with_missing_values:
    train[feature].fillna(0, inplace = True)

for feature in test_numeric_features_with_missing_values:
    test[feature].fillna(0, inplace = True)

categoric_features_with_NA = [
    'Alley',
    'BsmtCond',
    'BsmtExposure',
    'BsmtFinType1',
    'BsmtFinType2',
    'BsmtQual',
    'Fence',
    'FireplaceQu',
    'GarageCond',
    'GarageFinish',
    'GarageQual',
    'GarageType',
    'MiscFeature',
    'PoolQC'
]
for feature in train_categoric_features_with_missing_values:
    if feature in categoric_features_with_NA:
        train[feature].fillna('NA', inplace = True)
    else:
        train[feature].fillna(train[feature].value_counts().idxmax(), inplace = True)
        
for feature in test_categoric_features_with_missing_values:
    if feature in categoric_features_with_NA:
        test[feature].fillna('NA', inplace = True)
    else:
        test[feature].fillna(test[feature].value_counts().idxmax(), inplace = True)

# Convert year features to age features
year_features = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
train['AgeOfHouse'] = train['YrSold'] - train['YearBuilt']
train['AgeOfRemodAdd'] = train['YrSold'] - train['YearRemodAdd']
train['AgeOfGarage'] = train['YrSold'] - train['GarageYrBlt']
train['AgeOfSale'] = latest_year_house_sold - train['YrSold']
test['AgeOfHouse'] = test['YrSold'] - test['YearBuilt']
test['AgeOfRemodAdd'] = test['YrSold'] - test['YearRemodAdd']
test['AgeOfGarage'] = test['YrSold'] - test['GarageYrBlt']
test['AgeOfSale'] = latest_year_house_sold - test['YrSold']
age_features = ['AgeOfHouse', 'AgeOfRemodAdd', 'AgeOfGarage', 'AgeOfSale']
train.drop(year_features, axis=1, inplace=True)
test.drop(year_features, axis=1, inplace=True)
numeric_features_columns = list(set(numeric_features_columns) - set(year_features))
numeric_features_columns = list(set(numeric_features_columns).union(set(age_features)))

print('After basic data preprocessing:')
print('Number of numeric features = ' + str(len(numeric_features_columns)))
print('Number of categorical features = ' + str(len(categorical_features_columns)))
print('')

# Define categorical features to transform to numerical, and the chosen numerical transformations. If you want nothing transformed, leave list blank. If you want more transformed, add features to the list, and a numerical transformation for that feature below.
categorical_features_to_convert_to_numerical = [
#     'Alley',
#     'BsmtCond',
#     'BsmtExposure',
#     'BsmtFinType1',
#     'BsmtFinType2',
#     'BsmtQual',
#     'CentralAir',
#     'ExterCond',
#     'ExterQual',
#     'Fence',
#     'FireplaceQu',
#     'Functional',
#     'GarageCond',
#     'GarageFinish',
#     'GarageQual'
#     'HeatingQC',
#     'KitchenQual',
#     'LandSlope',
#     'PavedDrive',
#     'PoolQC'
#     'Street',
#     'Utilities'
]

def numerical_transformations(feature, x):
    if feature in ['ExterQual', 'ExterCond', 'HeatingQC', 'KitchenQual', 'PoolQC']:
        if x == 'Ex':
            return 4
        elif x == 'Gd':
            return 3
        elif x == 'TA':
            return 2
        elif x == 'Fa':
            return 1
        else:
            return 0

    if feature in ['BsmtCond', 'FireplaceQu', 'GarageQual', 'GarageCond']: 
        if x == 'Ex':
            return 5
        elif x == 'Gd':
            return 4
        elif x == 'TA':
            return 3
        elif x == 'Fa':
            return 2
        elif x == 'Po':
            return 1
        else:
            return 0

    if feature in ['BsmtFinType1', 'BsmtFinType2']:
        if x == 'GLQ':
            return 6
        elif x == 'ALQ':
            return 5
        elif x == 'BLQ':
            return 4
        elif x == 'Rec':
            return 3
        elif x == 'LwQ':
            return 2
        elif x == 'Unf':
            return 1
        else:
            return 0

    if feature == 'BsmtExposure':
        if x == 'Gd':
            return 4
        elif x == 'Av':
            return 3
        elif x == 'Mn':
            return 2
        elif x == 'No':
            return 1
        else:
            return 0

    if feature == 'Functional':
        if x == 'Typ':
            return 7
        elif x == 'Min1':
            return 6
        elif x == 'Min2':
            return 5
        elif x == 'Mod':
            return 4
        elif x == 'Maj1':
            return 3
        elif x == 'Maj2':
            return 2
        elif x == 'Sev':
            return 1
        else:
            return 0

    if feature == 'GarageFinish':
        if x == 'Fin':
            return 3
        elif x == 'RFn':
            return 2
        elif x == 'Unf':
            return 1
        else:
            return 0

    if feature == 'Fence':
        if x == 'GdPrv':
            return 4
        elif x == 'MnPrv':
            return 3
        elif x == 'GdWo':
            return 2
        elif x == 'MnWw':
            return 1
        else:
            return 0

    if feature == 'BsmtQual':
        if x == 'Ex':
            return 105
        elif x == 'Gd':
            return 95
        elif x == 'TA':
            return 85
        elif x == 'Fa':
            return 75
        elif x == 'Po':
            return 65
        else:
            return 0
        
    if feature == 'CentralAir':
        if x == 'Y':
            return 1
        else:
            return 0
        
    if feature == 'Street':
        if x == 'Pave':
            return 1
        else:
            return 0

    if feature == 'Alley':
        if x == 'Grvl' or x == 'Pave':
            return 1
        else:
            return 0

    if feature == 'LandSlope':
        if x == 'Gtl':
            return 1
        else:
            return 0
        
    if feature == 'PavedDrive':
        if x == 'Y' or x == 'P':
            return 1
        else:
            return 0
    
    if feature == 'Utilities':
        if x == 'AllPub':
            return 1
        else:
            return 0

# Transform the chosen categorical features using the numerical chosen transformations
for feature in categorical_features_to_convert_to_numerical:
    train[feature] = train[feature].apply(lambda x: numerical_transformations(feature, x))
    test[feature] = test[feature].apply(lambda x: numerical_transformations(feature, x))
    categorical_features_columns.remove(feature)
    numeric_features_columns.append(feature)
    
print('After numerical transformations of some categorical features:')
print('Number of numeric features = ' + str(len(numeric_features_columns)))
print('Number of categorical features = ' + str(len(categorical_features_columns)))
print('')

# Remove categorical features with too few unique entries
train_categorical_features_with_few_unique_entries = [feature for feature in categorical_features_columns if train[feature].value_counts()[0]/len(train.index) > categorical_parameter]
test_categorical_features_with_few_unique_entries = [feature for feature in categorical_features_columns if test[feature].value_counts()[0]/len(test.index) > categorical_parameter]
categorical_features_list_with_too_few_unique_entries = list(set(train_categorical_features_with_few_unique_entries).union(set(test_categorical_features_with_few_unique_entries)))
categorical_features_columns = list(set(categorical_features_columns) - set(categorical_features_list_with_too_few_unique_entries))
train.drop(categorical_features_list_with_too_few_unique_entries, axis=1, inplace=True)
test.drop(categorical_features_list_with_too_few_unique_entries, axis=1, inplace=True)

print('After removing categorical features with little information:')
print('Number of numeric features = ' + str(len(numeric_features_columns)))
print('Number of categorical features = ' + str(len(categorical_features_columns)))
print('')

# Log transformation of train SalePrice
train_sale_price['SalePrice'] = train_sale_price['SalePrice'].apply(np.log)

# Remove those features whose train column is weakly correlated with train SalePrice
train_correlations = pd.concat([train[numeric_features_columns], train_sale_price['SalePrice']], axis=1).corr() # We restrict our attention to the numeric features

features_with_low_correlation_to_sale_price = [feature for feature in numeric_features_columns if train_correlations[feature]['SalePrice'] < +weak_correlations_paramenter and train_correlations[feature]['SalePrice'] > -weak_correlations_paramenter] # All train featutes for which the correlation with train SalePrice is between -0.1 and +0.1.
numeric_features_columns = list(set(numeric_features_columns) - set(features_with_low_correlation_to_sale_price))
train.drop(features_with_low_correlation_to_sale_price, axis=1, inplace=True)
test.drop(features_with_low_correlation_to_sale_price, axis=1, inplace=True)

print('After removing features whose train column is weakly correlated with train SalePrice:')
print('Number of numeric features = ' + str(len(numeric_features_columns)))
print('Number of categorical features = ' + str(train.shape[1]-len(numeric_features_columns)))
print('')

# Remove 1 feature from every pair of features whose train columns are strongly correlated
train_correlations = pd.concat([train[numeric_features_columns], train_sale_price['SalePrice']], axis=1).corr() # We restrict our attention to the numeric features

feature_pairs_with_strong_correlation = []
features_from_each_strongly_correlated_pair_more_weakly_correlated_with_SalePrice = []

for feature1 in numeric_features_columns:
    for feature2 in numeric_features_columns[numeric_features_columns.index(feature1)+1:]:
        if train_correlations[feature1][feature2] > +strong_correlations_paramenter or train_correlations[feature1][feature2] < -strong_correlations_paramenter:
            feature_pairs_with_strong_correlation.append([feature1, feature2])            
            if train_correlations[feature1]['SalePrice'] > train_correlations[feature2]['SalePrice']:
                features_from_each_strongly_correlated_pair_more_weakly_correlated_with_SalePrice.append(feature2)
            else:
                features_from_each_strongly_correlated_pair_more_weakly_correlated_with_SalePrice.append(feature1)


strongly_correlated_features_to_remove = list(set(features_from_each_strongly_correlated_pair_more_weakly_correlated_with_SalePrice))
numeric_features_columns = list(set(numeric_features_columns) - set(strongly_correlated_features_to_remove))
train.drop(strongly_correlated_features_to_remove, axis=1, inplace=True)
test.drop(strongly_correlated_features_to_remove, axis=1, inplace=True)

print('After removing 1 feature from every pair of features whose train columns are strongly correlated:')
print('Number of numeric features = ' + str(len(numeric_features_columns)))
print('Number of categorical features = ' + str(train.shape[1]-len(numeric_features_columns)))
print('')

# Scale the train and test numeric features
numeric_train_feature_array = np.array(train[numeric_features_columns])
train_feature_scaler = MinMaxScaler()
train_feature_scaler.fit(numeric_train_feature_array)
train[numeric_features_columns] = pd.DataFrame(train_feature_scaler.transform(numeric_train_feature_array), columns = numeric_features_columns)
train_saleprice_array = np.array(train_sale_price)
train_salePrice_scaler = MinMaxScaler()
train_salePrice_scaler.fit(train_saleprice_array)
train_sale_price['SalePrice'] = pd.DataFrame(train_salePrice_scaler.transform(train_saleprice_array), columns = ['SalePrice'])
numeric_test_feature_array = np.array(test[numeric_features_columns])
test[numeric_features_columns] = pd.DataFrame(train_feature_scaler.transform(numeric_test_feature_array), columns = numeric_features_columns)

# The dataframes on which we will perform the regression
X = train.copy()
Y = train_sale_price.copy()

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=.33)
actual_scaled_array = np.array(y_test)
number_of_values_to_predict = actual_scaled_array.shape[0]

# Remove outlier rows from X and Y, and from x_train and y_train, but not x_test and y_test.
Y = Y[Y['SalePrice'] < outlier_salePrice_upper_bound]
Y = Y[Y['SalePrice'] > outlier_salePrice_lower_bound]
X = X[X.index.isin(list(Y.index))]

print('After removing train SalePrice outlier rows:')
print('Number of train rows = ' + str(X.shape[0]))
print('')

for (feature, value_bound) in far_from_typical_outliers_with_bounds:
    if feature in list(X.columns): # This check is added as some features may have been removed above
        X = X[X[feature] < value_bound]
Y = Y[Y.index.isin(list(X.index))]

x_train = x_train[x_train.index.isin(list(X.index))]
y_train = y_train[y_train.index.isin(list(X.index))]

print('After removing "far from typical" train rows:')
print('Number of train rows = ' + str(X.shape[0]))
print('')

# Initialise regressor for train-test split
def numeric_feature_column(feature):
    return tf.contrib.layers.real_valued_column(feature)
def embedded_feature_column(feature):
    categorical_column = tf.contrib.layers.sparse_column_with_hash_bucket(feature, hash_bucket_size=1000)
    return tf.contrib.layers.embedding_column(sparse_id_column=categorical_column, dimension=16,combiner="sum")

feature_cols = [
    numeric_feature_column(feature) for feature in numeric_features_columns
] + [
    embedded_feature_column(feature) for feature in categorical_features_columns
]

# regressor = tf.contrib.learn.DNNRegressor(
#     feature_columns = feature_cols,
#     hidden_units=nodes,
#     activation_fn = activation_function,
#     dropout=drop
# )

# Train regressor for the train-test split using x_train and y_train, and evaluate using x_test and y_test.
def input_fn_train(x, y):
    continuous_cols = {feature: tf.constant(x[feature].values) for feature in numeric_features_columns}
    categorical_cols = {feature: tf.SparseTensor(
        indices=[[i, 0] for i in range(x[feature].size)], values = x[feature].values, dense_shape = [x[feature].size, 1]) for feature in categorical_features_columns}
    value_col = tf.constant(y['SalePrice'].values)

    return {**continuous_cols, **categorical_cols}, value_col

# regressor.fit(input_fn = lambda: input_fn_train(x_train, y_train), steps=steps)
# print(regressor.evaluate(input_fn = lambda: input_fn_train(x_test, y_test), steps=1))

# Predict values of y_test using the trained regressor and x_test
def input_fn_predict(x):
    continuous_columns = {feature: tf.constant(x[feature].values) for feature in numeric_features_columns}
    categorical_columns = {feature: tf.SparseTensor(
        indices=[[i, 0] for i in range(x[feature].size)], values = x[feature].values, dense_shape = [x[feature].size, 1]) for feature in categorical_features_columns}

    return {**continuous_columns, **categorical_columns}

# def actual_scaled_array():
#     return np.array(y_test)
# number_of_values_to_predict = y_test.shape[0]
# def predicted_scaled_array():
#     return np.array(list(regressor.predict(input_fn=lambda: input_fn_predict(x_test)))).reshape(number_of_values_to_predict,1)

# actual_log_prices = pd.DataFrame(train_salePrice_scaler.inverse_transform(actual_scaled_array()), columns = ['SalePrice'])
# predicted_log_prices = pd.DataFrame(train_salePrice_scaler.inverse_transform(predicted_scaled_array()), columns = ['SalePrice'])

# # Score and plot the train-test split
# print('RMSE of train-test split = ' + str(np.sqrt(mean_squared_error(actual_log_prices.values, predicted_log_prices.values))))

# fig = plt.figure()
# ax = fig.gca()
# ax.set_xticks(np.arange(10, 14, .5))
# ax.set_yticks(np.arange(10, 14, .5))
# plt.xlim(10,  14)
# plt.ylim(10, 14)
# plt.plot([10,14], [10,14])
# plt.grid()
# plt.scatter(actual_log_prices, predicted_log_prices)
# plt.ylabel('Predicted Price')
# plt.xlabel('Actual Price')
# plt.title('DNN Regression')
# plt.show()

# Initialise a new regressor for the whole of X and Y using the same feature_columns, etc, as above
regressor = tf.contrib.learn.DNNRegressor(
    feature_columns = feature_cols,
    hidden_units=nodes,
    activation_fn = activation_function,
    dropout=drop
)

# Train the regressor using the whole of X and Y
regressor.fit(input_fn = lambda: input_fn_train(X, Y), steps=steps)

# The values to be predicted
X_test = test.copy()
number_of_values_to_predict = X_test.shape[0]

# Predict using the regressor and X_test
def predicted_scaled_array():
    return np.array(list(regressor.predict(input_fn=lambda: input_fn_predict(X_test)))).reshape(number_of_values_to_predict,1)
predicted_log_prices = pd.DataFrame(train_salePrice_scaler.inverse_transform(predicted_scaled_array()), columns = ['SalePrice'])
submission['SalePrice'] = predicted_log_prices['SalePrice'].apply(np.exp)

# Generate a kaggle submission file
submission.to_csv('submission_dnn_regression_numerical_and_categorical.csv',index=False)

submission

After basic data preprocessing:
Number of numeric features = 34
Number of categorical features = 45

After numerical transformations of some categorical features:
Number of numeric features = 34
Number of categorical features = 45

After removing categorical features with little information:
Number of numeric features = 34
Number of categorical features = 37

After removing features whose train column is weakly correlated with train SalePrice:
Number of numeric features = 32
Number of categorical features = 37

After removing 1 feature from every pair of features whose train columns are strongly correlated:
Number of numeric features = 30
Number of categorical features = 37

After removing train SalePrice outlier rows:
Number of train rows = 1460

After removing "far from typical" train rows:
Number of train rows = 1460

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.Cluster

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpkob1r628/model.ckpt.
INFO:tensorflow:loss = 0.33614564, step = 1
INFO:tensorflow:global_step/sec: 18.5139
INFO:tensorflow:loss = 0.004207215, step = 101 (5.402 sec)
INFO:tensorflow:global_step/sec: 32.4631
INFO:tensorflow:loss = 0.0036798366, step = 201 (3.081 sec)
INFO:tensorflow:global_step/sec: 29.0051
INFO:tensorflow:loss = 0.0028823735, step = 301 (3.448 sec)
INFO:tensorflow:global_step/sec: 33.4727
INFO:tensorflow:loss = 0.0027149245, step = 401 (2.988 sec)
INFO:tensorflow:global_step/sec: 33.2444
INFO:tensorflow:loss = 0.0026313653, step = 501 (3.007 sec)
INFO:tensorflow:global_step/sec: 32.0751
INFO:tensorflow:loss = 0.0023532393, step = 601 (3.118 sec)
INFO:tensorflow:global_step/sec: 33.3822
INFO:tensorflow:loss = 0.0023186866, step = 701 (2.996 sec)
INFO:tensor

INFO:tensorflow:global_step/sec: 22.4364
INFO:tensorflow:loss = 0.0009095396, step = 7401 (4.457 sec)
INFO:tensorflow:global_step/sec: 23.074
INFO:tensorflow:loss = 0.00085505046, step = 7501 (4.334 sec)
INFO:tensorflow:global_step/sec: 23.1345
INFO:tensorflow:loss = 0.0008667884, step = 7601 (4.323 sec)
INFO:tensorflow:global_step/sec: 23.4652
INFO:tensorflow:loss = 0.0008544641, step = 7701 (4.262 sec)
INFO:tensorflow:global_step/sec: 23.3329
INFO:tensorflow:loss = 0.0008183793, step = 7801 (4.286 sec)
INFO:tensorflow:global_step/sec: 23.8949
INFO:tensorflow:loss = 0.0008221483, step = 7901 (4.185 sec)
INFO:tensorflow:global_step/sec: 23.7097
INFO:tensorflow:loss = 0.0008365698, step = 8001 (4.218 sec)
INFO:tensorflow:global_step/sec: 24.0184
INFO:tensorflow:loss = 0.00082343287, step = 8101 (4.163 sec)
INFO:tensorflow:global_step/sec: 24.3193
INFO:tensorflow:loss = 0.0008447776, step = 8201 (4.112 sec)
INFO:tensorflow:global_step/sec: 25.9167
INFO:tensorflow:loss = 0.0008524832, ste

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpkob1r628/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,Id,SalePrice
0,1461,109157.890625
1,1462,142285.515625
2,1463,181734.296875
3,1464,188440.890625
4,1465,194992.500000
...,...,...
1454,2915,87860.437500
1455,2916,81350.429688
1456,2917,161413.843750
1457,2918,110682.695312


Next we do a linear regression using only the numeric features, ignoring the categorical features.

In [None]:
# Restrict to the numeric features. Y and y_train and y_test are are unchanged from above
X = X[numeric_features_columns]
x_train = x_train[numeric_features_columns]
x_test = x_test[numeric_features_columns]

# Initialise the regressor for the train test split. Note feature_cols now contains no embedded columns
feature_cols = [numeric_feature_column(feature) for feature in numeric_features_columns]

# regressor = tf.contrib.learn.DNNRegressor(
#     feature_columns = feature_cols,
#     hidden_units=nodes,
#     activation_fn = activation_function,
#     dropout=drop
# )

# Train regressor for the train-test split using x_train and y_train, and evaluate using x_test and y_test. Note the input function has no categorical_cols
def input_fn_train(x, y):
    continuous_cols = {feature: tf.constant(x[feature].values) for feature in numeric_features_columns}
    value_col = tf.constant(y['SalePrice'].values)
    return continuous_cols, value_col

# regressor.fit(input_fn = lambda: input_fn_train(x_train, y_train), steps=steps)
# print(regressor.evaluate(input_fn = lambda: input_fn_train(x_test, y_test), steps=1))

# Predict values of y_test using the trained regressor and x_test
def input_fn_predict(x):
    continuous_columns = {feature: tf.constant(x[feature].values) for feature in numeric_features_columns}
    return continuous_columns

# def actual_scaled_array():
#     return np.array(y_test)
# number_of_values_to_predict = y_test.shape[0]
# def predicted_scaled_array():
#     return np.array(list(regressor.predict(input_fn=lambda: input_fn_predict(x_test)))).reshape(number_of_values_to_predict,1)

# actual_log_prices = pd.DataFrame(train_salePrice_scaler.inverse_transform(actual_scaled_array()), columns = ['SalePrice'])
# predicted_log_prices = pd.DataFrame(train_salePrice_scaler.inverse_transform(predicted_scaled_array()), columns = ['SalePrice'])

# # Score and plot the train-test split
# print('RMSE of train-test split = ' + str(np.sqrt(mean_squared_error(actual_log_prices.values, predicted_log_prices.values))))

# fig = plt.figure()
# ax = fig.gca()
# ax.set_xticks(np.arange(10, 14, .5))
# ax.set_yticks(np.arange(10, 14, .5))
# plt.xlim(10,  14)
# plt.ylim(10, 14)
# plt.plot([10,14], [10,14])
# plt.grid()
# plt.scatter(actual_log_prices, predicted_log_prices)
# plt.ylabel('Predicted Price')
# plt.xlabel('Actual Price')
# plt.title('DNN Regression')
# plt.show()

# Initialise a new regressor for the whole of X and Y using the same feature_columns, etc, as above
regressor = tf.contrib.learn.DNNRegressor(
    feature_columns = feature_cols,
    hidden_units=nodes,
    activation_fn = activation_function,
    dropout=drop
)

# Train the regressor using the whole of X and Y
regressor.fit(input_fn = lambda: input_fn_train(X, Y), steps=steps)

# The values to be predicted
X_test = X_test[numeric_features_columns]
number_of_values_to_predict = X_test.shape[0]

# Predict using the regressor and X_test
def predicted_scaled_array():
    return np.array(list(regressor.predict(input_fn=lambda: input_fn_predict(X_test)))).reshape(number_of_values_to_predict,1)
predicted_log_prices = pd.DataFrame(train_salePrice_scaler.inverse_transform(predicted_scaled_array()), columns = ['SalePrice'])
submission['SalePrice'] = predicted_log_prices['SalePrice'].apply(np.exp)

# Generate a kaggle submission file
submission.to_csv('submission_dnn_regression_numerical_only.csv',index=False)

submission

**Experimenting with different parameters**

Let us see how good a score we can get with the above code. We restrict to only the numerical and categorical scores, ignoring the numerical scores. 

Let us begin with the basic data preprocessing ONLY and nothing else: Take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12429. Loss 0.00070969027.
    
Next take categorical_parameter > 1, weak_correlations_paramenter = 0.01, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12306. Loss 0.00072850403.
    
Next take categorical_parameter > 1, weak_correlations_paramenter = 0.03, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12261. Loss 0.0007177028.   
Next take categorical_parameter > 1, weak_correlations_paramenter = 0.037, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12950. Loss 0.0007350311.
    
Next take categorical_parameter > 1, weak_correlations_paramenter = 0.05, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12830. Loss 0.00074797316.
    
Next take categorical_parameter > 1, weak_correlations_paramenter = 0.1, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12890. Loss 0.00069728785.
    


Next take categorical_parameter = 0.99, weak_correlations_paramenter = 0.03, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12524. Loss 0.000734859.
    
Next take categorical_parameter = 0.98, weak_correlations_paramenter = 0.03, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12283. Loss 0.00070429454.
    
Next take categorical_parameter = 0.96, weak_correlations_paramenter = 0.03, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12436. Loss 0.00074239756.
    
Next take categorical_parameter = 0.95, weak_correlations_paramenter = 0.03, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12346. Loss 0.0007727057.
    
Next take categorical_parameter = 0.98, weak_correlations_paramenter = 0.02, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12311. Loss 0.0007495586.
    
Next take categorical_parameter = 0.96, weak_correlations_paramenter = 0.02, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12361. Loss 0.0007564796.
    
Next take categorical_parameter = 0.95, weak_correlations_paramenter = 0.02, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12164. Loss 0.0007528564.
    
    
 
    
Next take categorical_parameter = 0.96, weak_correlations_paramenter = 0.03, strong_correlations_paramenter = 0.85, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12378. Loss 0.0007258149   
    
Next take categorical_parameter = 0.96, weak_correlations_paramenter = 0.03, strong_correlations_paramenter = 0.82, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12329. Loss 0.000731381.
    
Next take categorical_parameter = 0.96, weak_correlations_paramenter = 0.03, strong_correlations_paramenter = 0.8, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12494. Loss 0.0008165123.
    
Next take categorical_parameter = 0.95, weak_correlations_paramenter = 0.02, strong_correlations_paramenter = 0.82, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY categorical features. Then we choose the DNN parameters:

    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12324. Loss 0.0008115017.
    
    
    
    

    

 
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound = 0.95, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12400. Loss 0.00069838477.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound = 0.9, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12407. Loss 0.00072268647. 

Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound = 0.85, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12378. Loss 0.0007233704.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound = 0.8, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12398. Loss 0.00072149455.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound = 0.025, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12409. Loss 0.0007291263.

Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound = 0.05, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12658. Loss 0.0007343705.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound = 0.1, remove no far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12693. Loss 0.0007110241.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove ALL far from typical outliers. Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.13317. Loss 0.0006934936.

Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove the following far from typical outliers: ('LotArea', 0.75), ('LotFrontage', 0.75), ('MasVnrArea', 0.75), ('BsmtFinSF1', 0.75), ('TotalBsmtSF', 0.75), ('2ndFlrSF', 0.75), ('1stFlrSF', 0.75), ('GrLivArea', 0.75), ('BsmtFullBath', .75), ('TotRmsAbvGrd', 1), ('GarageArea', 0.75), ('OpenPorchSF', 0.75), ('MiscVal', 0.75). Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.13380. Loss 0.00069774303.

Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove the following far from typical outliers: ('LotArea', 0.9), ('LotFrontage', 0.9), ('MasVnrArea', 0.9), ('BsmtFinSF1', 0.9), ('TotalBsmtSF', 0.9), ('2ndFlrSF', 0.9), ('1stFlrSF', 0.9), ('GrLivArea', 0.9), ('BsmtFullBath', .9), ('TotRmsAbvGrd', 1), ('GarageArea', 0.9), ('OpenPorchSF', 0.9), ('MiscVal', 0.9). Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12829. Loss 0.0007705575.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove the following far from typical outliers: ('LotArea', 0.98), ('LotFrontage', 0.98), ('MasVnrArea', 0.98), ('BsmtFinSF1', 0.98), ('TotalBsmtSF', 0.98), ('2ndFlrSF', 0.98), ('1stFlrSF', 0.98), ('GrLivArea', 0.98), ('BsmtFullBath', .98), ('TotRmsAbvGrd', 1), ('GarageArea', 0.98), ('OpenPorchSF', 0.98), ('MiscVal', 0.98). Do not numerically transform ANY features. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12438. Loss 0.0007011956.

Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'CentralAir', 'Street'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12236. Loss 0.00079541735.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'Alley', 'CentralAir', 'LandSlope', 'PavedDrive', 'Street', 'Utilities'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12232. Loss 0.0008793639.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'ExterCond', 'ExterQual', 'HeatingQC', 'KitchenQual', 'PoolQC'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12514. Loss 0.0007469458.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'BsmtCond', 'FireplaceQu', 'GarageCond', 'GarageQual'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12292. Loss 0.00079709815.

Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12385. Loss 0.0007973538.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'GarageCond', 'GarageFinish', 'GarageQual'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12378. Loss 0.0007574329.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'GarageCond', 'GarageFinish', 'GarageQual'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12445. Loss 0.000883769.
    
Next take categorical_parameter > 1, weak_correlations_paramenter < 0, strong_correlations_paramenter > 1, outlier_salePrice_upper_bound > 1, outlier_salePrice_lower_bound < 0, remove no far from typical outliers. Numerically transform ONLY the following features: 'Alley', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir', 'FireplaceQu', 'GarageCond', 'GarageFinish', 'GarageQual', 'LandSlope', 'PavedDrive', 'Street', 'Utilities'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12614. Loss 0.0011753084.
    
Next take categorical_parameter = 0.96, weak_correlations_paramenter = 0.03, strong_correlations_paramenter = 0.8, outlier_salePrice_upper_bound = 0.8, outlier_salePrice_lower_bound = 0.025. Remove the following far from typical outliers: ('LotArea', 0.98), ('LotFrontage', 0.98), ('MasVnrArea', 0.98), ('BsmtFinSF1', 0.98), ('TotalBsmtSF', 0.98), ('2ndFlrSF', 0.98), ('1stFlrSF', 0.98), ('GrLivArea', 0.98), ('BsmtFullBath', .98), ('TotRmsAbvGrd', 1), ('GarageArea', 0.98), ('OpenPorchSF', 0.98), ('MiscVal', 0.98). Numerically transform ONLY the following features: 'Alley', 'CentralAir', 'LandSlope', 'PavedDrive', 'Street', 'Utilities'. Then we choose the DNN parameters:
    
    nodes = [1200], activation_function = tf.nn.relu, drop = 0.35, steps = 10000. RMSE 0.12835. Loss 0.00084135204.

    





    
    