# Linear Regression on House Prices (1D) with Keras

In this notebook we show how to perform linear regression on house prices data, using Keras.

## Define `X_train` and `Y_train`

In [None]:
import pandas as pd
import numpy as np
from numpy import transpose
import matplotlib.pyplot as plt
# %matplotlib notebook
import seaborn as sns

np.random.seed(42)
# tf.random.set_seed(42)
pd.set_option('display.max_columns', 100)

In [None]:
data = pd.read_csv('sources/train.csv')

X_train_ID = data['Id']

data.fillna(0)
data['Surface'] = data['GrLivArea'] + data['TotalBsmtSF']

Y_train = data.SalePrice.values.astype(float)
X_train = transpose([data.Surface.values.astype(float)])
print(f"X train {X_train.shape}")
print(f"Y train {Y_train.shape}")

## Dataviz X_train Y_train

In [None]:
sns.regplot(x = X_train, y = Y_train)

## Model Sequential avec Keras
Define input layer. Number of neurons = number of features in `x`.

Define output layer, with 1 neuron.
`Dense` creates _fully-connected_ layer.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
model = Sequential()
model = Sequential([
    Input(shape=X_train.shape[1:]),
    Dense(1)
])

### Scale features

"Normalizing" the data should help prevent values from "exploding":

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

## Set up optimizer and "compile" model
In Keras, optimizer is specified when "compiling" model.
(Last step to fix model training code!)

In [None]:
from tensorflow.keras.optimizers import SGD
loss="mse"
LEARNING_RATE = 0.001
model.compile(loss=loss, optimizer=SGD(lr=LEARNING_RATE))

## Fit model (i.e. run optimization)

The model still hasn't "seen" any data yet...

* With SGD we can choose the amount of data to be used to compute the loss function (`BATCH_SIZE`). This can be useful when the whole dataset doesn't fit in memory. We'll revisit this later!
* We need to choose for how many iterations to run SGD (`EPOCHS`).
* The `fit` method is a loop over epochs and batches!

In [None]:
BATCH_SIZE = X_train.shape[0] # computing the loss over the whole dataset
EPOCHS = 2000 # how many iterations over the whole dataset
history = model.fit(X_train, Y_train, epochs=EPOCHS, batch_size=BATCH_SIZE)

## Review learning curve

In [None]:
hist = pd.DataFrame(history.history)

In [None]:
sns.relplot(x=hist.index, y="loss", kind="line", data=hist)

## Visualize model

* Model is line defined by coefficient `W` and bias (a.k.a. intercept) `b`
* Just need to plot 2 points and link them... 
    * x-axis: let's choose minimum and maximum of `X_train`
    * y-axis: given by model's predictions

In [None]:
x_line = np.transpose([[X_train.min(), X_train.max()]])
y_line = model.predict(x_line)

In [None]:
print("Point 1: [" + str(x_line[0][0]) + ", " + str(y_line[0]) + " ]")
print("Point 2: [" + str(x_line[1][0]) + ", " + str(y_line[1]) + " ]")

In [None]:
plt.plot(X_train, Y_train, "b.")
plt.plot(x_line, y_line, "r-") # "r-" means we plot data points in red and link them with a line
plt.show()

In [None]:
data_test = pd.read_csv('sources/test.csv')
data_test.fillna(0, inplace=True)
data_test['Surface'] = data_test['TotalBsmtSF'] + data_test['GrLivArea']

In [None]:
X_test = transpose([data_test.Surface.values.astype(float)])
print(f"X test {X_test.shape}")

In [None]:
X_test = scaler.transform(X_test) # On applique le scaler sur les Surfaces du dataset de test

In [None]:
Y_test = model.predict(x=X_test)
Y_test = np.reshape(Y_test, Y_test.shape[0],)

In [None]:
data_test[660:662]

In [None]:
sns.scatterplot(x = data_test['Surface'], y = Y_test)

In [None]:
data_test['SalePrice'] = Y_test

In [None]:
data_test

In [None]:
data_test.drop(data_test.columns.difference(['Id','SalePrice']), 1, inplace=True)

In [None]:
data_test

In [None]:
data_test.to_csv('storage/kaggle_submission_file.csv', index=False)