# Keras Regression Project 

## The Data

We will be using data from a Kaggle data set:

https://www.kaggle.com/harlfoxem/housesalesprediction

It is historical housing data for King County, USA.

#### Feature Columns
    
* id - Unique ID for each home sold
* date - Date of the home sale
* price - Price of each home sold
* bedrooms - Number of bedrooms
* bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* sqft_living - Square footage of the apartments interior living space
* sqft_lot - Square footage of the land space
* floors - Number of floors
* waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
* view - An index from 0 to 4 of how good the view of the property was
* condition - An index from 1 to 5 on the condition of the apartment,
* grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* sqft_above - The square footage of the interior housing space that is above ground level
* sqft_basement - The square footage of the interior housing space that is below ground level
* yr_built - The year the house was initially built
* yr_renovated - The year of the house’s last renovation
* zipcode - What zipcode area the house is in
* lat - Lattitude
* long - Longitude
* sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
* sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')

# Exploratory Data Analysis

In [None]:
df.isnull().sum()

In [None]:
df.describe().transpose()

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(df['price'])

Most of our houses fall between 0 and 1.5 million dollars. It probably makes sense to drop our extreme outliers.

In [None]:
sns.countplot(df['bedrooms'])

Majority of houses have 2-5 bedrooms.

In [None]:
df.corr()['price'].sort_values()

sqft_living has a high correlation with price

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='price',y='sqft_living',data=df)

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='bedrooms',y='price',data=df)

### Geographical properties

In [None]:
df.corr()['bedrooms'].sort_values()

In [None]:
df.corr()['zipcode'].sort_values()

There arent't really any correlations between bedrooms and zipcodes here. Let's keep visualizing our data 

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x='price',y='long',data=df)

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',data=df,hue='price')

In [None]:
df.sort_values('price',ascending=False).head(20)

In [None]:
len(df)*0.01

In [None]:
non_top_1_perc = df.sort_values('price',ascending=False).iloc[216:]

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x='long',y='lat',data=non_top_1_perc,
                edgecolor=None,alpha=0.2,palette='RdYlGn',hue='price')

In [None]:
sns.boxplot(x='waterfront',y='price',data=df)

## Working with feature data

#### Feature engineering from Date

In [None]:
df= df.drop('id',axis=1)

In [None]:
df['date']= pd.to_datetime(df['date'])

In [None]:
df['date']

In [None]:
df['year']= df['date'].apply(lambda date: date.year)
df['month']= df['date'].apply(lambda date: date.month)

In [None]:
df.head()

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='month',y='price',data=df)

In [None]:
df.groupby('month').mean()['price'].plot()

It looks like there may be some differences between months. The range is about 510k to 560k.

In [None]:
df.groupby('year').mean()['price'].plot()

In [None]:
df = df.drop('date',axis=1)

In [None]:
df.columns

In [None]:
df.head()

In [None]:
# df['zipcode'].value_counts()

In [None]:
df = df.drop('zipcode',axis=1)

In [None]:
df['yr_renovated'].value_counts()

In [None]:
df['sqft_basement'].value_counts()

In [None]:
X= df.drop("price",axis=1).values
y = df['price'].values

## Scaling and Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=101)


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
X_train.shape

In [None]:
model = Sequential()

model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))
model.add(Dense(19,activation='relu'))

model.add(Dense(1))

model.compile(optimizer='adam',loss='mse')

In [None]:
model.fit(x=X_train,y=y_train,
          validation_data=(X_test,y_test), 
         batch_size=128,epochs=400)

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.plot()

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error,explained_variance_score

In [None]:
predictions= model.predict(X_test)

In [None]:
mean_squared_error(y_test, predictions)

In [None]:
np.sqrt(mean_squared_error(y_test,predictions))

In [None]:
mean_absolute_error(y_test,predictions)

In [None]:
df['price'].describe()

In [None]:
5.402966e+05

In [None]:
explained_variance_score(y_test,predictions)

In [None]:
plt.figure(figsize=(12,6))
plt.scatter(y_test,predictions)
plt.plot(y_test,y_test,'r')

In [None]:
single_house = df.drop('price',axis=1).iloc[0]

our models are trained on scaled versions of features so we can't do it raw

In [None]:
single_house = scaler.transform(single_house.values.reshape(-1,19))

In [None]:
model.predict(single_house)

Predicting we will sell at $288,413, but it may be overshooting due to our outliers.