# House Prices Notebook

This notebook is an entry for the House Prices Advanced Regression competition. I tried this once and got a score of 0.7 (not very good), I then went and looked at some notebooks and noticed a lot of things that I did wrong first time round. Here is my updated attempt. One notebook I found particularly was this one by Pedro Marcelino: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python#COMPREHENSIVE-DATA-EXPLORATION-WITH-PYTHON. And this introduction to linear models: https://www.kaggle.com/omercansvgn/machine-learning-tutorial-for-beginners.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np # linear algebra
import pandas as pd # Data processing
import matplotlib.pyplot as plt # Visualisation
import seaborn as sns # Visualisation
from scipy import stats # Stats
from scipy.stats import norm # Normalising the data
from sklearn.preprocessing import StandardScaler # Preprocessing
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Inspecting the Data

First we need to look at the data to see what variables we are going to use in our model

In [None]:
# Train Data
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
# Test Data
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
train.columns

In [None]:
# Sale Price
train['SalePrice'].describe()
plt.style.use('ggplot')
sns.distplot(train['SalePrice'], color='blue')
plt.xticks(rotation=45)
plt.ylabel('Frequency')
plt.title('Sale Price Histogram')
plt.show()

The sale price has a positive skewness. This will need to be adjusted for later.

In [None]:
# Heatmap to find the variables that are correlated most with the Sale Price
corrmat = train.corr()
cols = corrmat.nlargest(10, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1)
hm = sns.heatmap(cm, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Looking at this heatmap here is my assessment of the variables:
* OverallQual is the most correlated so I will use this
* GrLivArea is also highly correlated so I will use this
* GarageCars and GarageArea are similar variables, so I will only keep GarageCars
* Similarly TotalBsmtSF and 1stFlrSF are similar variavles, so I will only keep TotalBsmtSF
* FullBath is a bit wierd that this is correlated but I'll use it
* TotalRmsAbvGrd is very similar to GrLivArea so I won't use it
* YearBuilt is only slightly correlated 

In [None]:
# Creating the training data
columns = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
train_data = train[columns]
# Scatter Plot of Variables
sns.set()
sns.pairplot(train_data, size = 2.5)
plt.show()

In [None]:
# Missing data
total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

That's handy, there isn't any missing data. I was going to create a separate heading for cleaning the data. But we don't need to.

# EDA

Now it's time to look at how our variables are related to the sale price. 

In [None]:
# OverallQual
sns.boxplot(data=train_data, x='OverallQual', y='SalePrice', palette=sns.color_palette(), linewidth=1)
plt.title('OverallQual BoxPlot')
plt.show()

It's not known how the overall quality was measured, and also the growth is non-linear (maybe polynomial or exponential, impossible to say).

In [None]:
# GrLivArea
plt.scatter(x=train_data['GrLivArea'], y=train_data['SalePrice'], alpha=0.4)
plt.xlabel('GrLivArea')
plt.ylabel('SalePrice')
plt.title('GrLivArea ScatterPlot')
plt.show()

Looks like there are two outliers in the bottom right corner. Let's remove them 

In [None]:
# Removing the points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
train_data = train[columns]

In [None]:
# GarageCars
sns.boxplot(data=train_data, x='GarageCars', y='SalePrice', palette=sns.color_palette(), linewidth=1)
plt.title('GarageCars BoxPlot')
plt.show()

In [None]:
# TotalBsmtSF
plt.scatter(x=train_data['TotalBsmtSF'], y=train_data['SalePrice'], alpha=0.4)
plt.xlabel('TotalBsmtSF')
plt.ylabel('SalePrice')
plt.title('TotalBsmtSF ScatterPlot')
plt.show()

There are two things to note from this scatter plot. First there is a large cluster of points with no basement size, which may cause problems. Secondly the point with the highest TotalBsmtSF (with a sale price just below 300,000) is an outlier and hence I will remove it.

In [None]:
# Removing the point
train.sort_values(by = 'TotalBsmtSF', ascending = False)[:1]
train = train.drop(train[train['Id'] == 333].index)
train_data = train[columns]

In [None]:
# FullBath
sns.boxplot(data=train_data, x='FullBath', y='SalePrice', palette=sns.color_palette(), linewidth=1)
plt.title('FullBath BoxPlot')
plt.show()

In [None]:
# YearBuilt
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(data=train_data, x='YearBuilt', y='SalePrice', linewidth=1)
plt.xticks(rotation=90)
plt.title('YearBuilt BoxPlot')
plt.show()

In [None]:
# YearBuilt
f, ax = plt.subplots(figsize=(16, 8))
plt.scatter(x=train_data['YearBuilt'], y=train_data['SalePrice'], alpha=0.4)
plt.xlabel('YearBuilt')
plt.ylabel('SalePrice')
plt.xticks(rotation=90)
plt.title('YearBuilt ScatterPlot')
plt.show()

Looks like this graph has an exponential growth rate. Kind of sucks for people who might be looking to buy a house in the near future like myself. Anyway, I now want to fix the skewed saleprice data so that it can be better used for our model.

In [None]:
# Histogram and normal plot
sns.distplot(train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)

Handily taking the log of the y value is a neat shortcut for fixing positive skewness which works almost all the time.

In [None]:
# Reshaping the SalePrice data
train['SalePrice'] = np.log1p(train['SalePrice'])
train_data['SalePrice'] = train['SalePrice']

# The original graphs but with the reshaped sale prices
sns.distplot(train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)

Much better! Let's look at the other numeric data.

In [None]:
# GrLivArea
kol = 'GrLivArea'
# Reshaping the data
train[kol] = np.log1p(train[kol])
train_data[kol] = train[kol]

# Graphs
sns.distplot(train[kol], fit=norm);
fig = plt.figure()
res = stats.probplot(train[kol], plot=plt)

In [None]:
train['TotalBsmtSF'] = np.log1p(train['TotalBsmtSF'])
train_data['TotalBsmtSF'] = train['TotalBsmtSF']

In [None]:
# Graphs for TotalBsmtSF
sns.distplot(train_data['TotalBsmtSF'], fit=norm);
fig = plt.figure()
res = stats.probplot(train_data['TotalBsmtSF'], plot=plt)

In [None]:
train_data.head()

# Creating the Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Splitting up that data
data = train_data.drop(['SalePrice'], axis=1)
labels = train_data['SalePrice']
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.3,random_state=42)
# Model
model = LinearRegression()
model.fit(x_train, y_train)

mse = mean_squared_error(y_test, model.predict(x_test)) 
rmse = np.sqrt(mse) 
# Results
print('Score:',model.score(x_test, y_test))
print('Model Intercept:',model.intercept_)
print('Model Coef:',model.coef_)
print('RMSE:',rmse)

In [None]:
pred = model.predict(x_test)
sns.distplot(pred, fit=norm);

# Fixing the Test Data

The test data also needs to be transformed so that our model can use it to make predictions.

In [None]:
cols = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
test_data = test[cols]
test_data.head()

In [None]:
# Check for null values
total = test_data.isnull().sum().sort_values(ascending=False)
percent = (test_data.isnull().sum()/test_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

In [None]:
# I will assume that these values are 0
test_data['GarageCars'] = test_data['GarageCars'].fillna(0)
test_data['TotalBsmtSF'] = test_data['TotalBsmtSF'].fillna(0)

In [None]:
# Now I need to transform the test data
test_data['GrLivArea'] = np.log1p(test_data['GrLivArea'])
test_data['TotalBsmtSF'] = np.log1p(test_data['TotalBsmtSF'])
test_data.head()

In [None]:
# Make predictions
log_prediction = model.predict(test_data)
sns.distplot(log_prediction, fit=norm);

In [None]:
# Create the Submission df
test['SalePrice'] = np.exp(log_prediction)
saleprice = test['SalePrice'] - 1
df_submit = pd.DataFrame({'Id': test['Id'], 'SalePrice': saleprice})
df_submit.to_csv('Submit.csv', index=False)