# House prices prediction demo
1. Introduction
2. Setup system & Load data
3. Data preparation
4. Model training and evaulation

## Introduction
Welcome to an end-to-end example of predictive modeling technique. In this demo, we will predict sales prices and practice feature engineering on The [Ames Housing dataset](http://jse.amstat.org/v19n3/decock.pdf)

## Setup system & Load data

### Load data from link

Let's start with download train and test dataset from our prepared google drive into ../input/ folder

In [0]:
!mkdir ./input
!wget -O ./input/train.csv https://drive.google.com/uc?id=1G-2hqAmlKF7nSqGbIT_HFMPqXmH-DJpH&export=download
!wget -O ./input/test.csv https://drive.google.com/uc?id=1x9ELPh4gUjPEx4Fv2yHWu4wKSlNYkro8&export=download
!wget -O ./input/data_description.txt https://drive.google.com/uc?id=1697k6a06knZ3ZVanbZOKJZYIswVebGH7&export=download

In [0]:
# check if download is completed
!ls ./input

### Load packages

In [0]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

pd.pandas.set_option('display.max_columns', None)

### Load data

In [0]:
df_train = pd.read_csv('./input/train.csv')

In [0]:
df_train.head()

In [0]:
df_train.shape

Check dataframe characteristics

In [0]:
df_train.describe()

In [0]:
df_train.columns

## Data preparation
* Drop non-related column
* Handle categorical variables
* Handle the numeric missing values
* Creating matrices for the model

### Preparing train data

In [0]:
# Drop non-related column
df_train_drop_non_relate = df_train.drop(['Id'], axis=1)

In [0]:
# Handle categorical variables: convert categorical variable into dummy/indicator variables
df_train_dummies = pd.get_dummies(df_train_drop_non_relate)

In [0]:
df_train_dummies.head()

In [0]:
df_train_dummies.shape

In [0]:
# Handle the numeric missing values: filling NA's with the mean of the column
df_train_fillna = df_train_dummies.fillna(df_train_dummies.mean())

In [0]:
# Creating matrices for the model
X_train = df_train_fillna.drop(['SalePrice'], axis=1)
y_train = df_train_fillna.SalePrice

### Preparing test data

In [0]:
df_test = pd.read_csv('./input/test.csv')

In [0]:
df_test.head()

In [0]:
# Drop non-related column
df_test = df_test.drop(['Id'], axis=1)

# Handle categorical variables: convert categorical variable into dummy/indicator variables
df_test_dummies = pd.get_dummies(df_test)

# Handle the numeric missing values: filling NA's with the 0
df_test_fillna = df_test_dummies.fillna(0)

df_test_fillna.shape

In [0]:
missing_columns = list(set(df_train_fillna.columns) - set(df_test_fillna.columns))

for mc in missing_columns:
  df_test_fillna[mc] = 0

In [0]:
df_test_fillna.shape

In [0]:
# Creating matrices for the model
X_test = df_test_fillna.drop(['SalePrice'], axis=1)
y_test = df_test.SalePrice

## Model training and evaulation

### Prediction

In this class we focus on data cleansing so we will use basic Linear Regression model

In [0]:
# Model Training using Linear Regression
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

In [0]:
# Model Evaluation
from sklearn.metrics import mean_squared_error

In [0]:
# train score
y_train_pred = lr.predict(X_train)
train_score = lr.score(X_train, y_train)

print("Train RMSE: ", mean_squared_error(y_train, y_train_pred)**0.5)
print("Train score: ", train_score)

In [0]:
# test score
y_test_pred = lr.predict(X_test)
test_score = lr.score(X_test, y_test)

print("Test RMSE: ", mean_squared_error(y_test, y_test_pred)**0.5)
print("Test score: ", test_score)

### The most important coefficients

What impact model the most?

In [0]:
coef = pd.Series(lr.coef_, index = X_train.columns)

In [0]:
imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])

In [0]:
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")