# 2019. 02. 08 시작
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard 커널 공부 및 번역하기

# Stacked Regressions to predict House Prices

## Serigne

### July 2017
### If you use parts of this notebook in your scripts/notebooks, giving some kind of credit would be very much appreciated :) You can for instance link back to this notebook. Thanks!

This competition is very important to me as it helped me to begin my journey on Kaggle few months ago. I've read some great notebooks here. To name a few : 
1. Comprehensive data exploration with Python by Pedro Marcelino: Great and very motivational data analysis

2. A study on Regression applied to the Ames dataset by Julien CohenSolal : Thorough featrues engineering and deep dive into linear regression analysis but really easy to follow for beginners.

3. Regularized Linear Models by Alexandru Papiu : Great Starter kernel on modelling and Cross-validation 

I can't recomment enough every beginner to go carefully through these kernels(and of course through many others great kernels) and get their first insights in data science and kaggle compeititons.

After that (and some basic practices) you should be more confident to go through this great script by Human Analog who did an impressive work on featrues engineering.

As the dataset is partivularly handy, i decided few days ago to get back in this competition and apply things I learnt so far, especially stacking models. For that purpose, we build two stacking classes ( the simplest approach and a less simple one).

As these classes are written for general purpose, you can easily adapt them and/or extend them for your regressin problems. The overall approach is hopefully concise and easy to follow.

The featrues engineering is rather parsimonious (at least compared to some others great scripts) . It is pretty much :
+ **Imputing missing values** by proceeding sequentially though the data
+ **Transforming** some numerical variables that seem really categorical
+ **Label Encoding** some categorical variables that may contain information in their ordering set 
+ **Box Cox Transformation** of skewed featrues (instead of log-transformation) : This gave me a slightly bertter result both on leaderboard and cross-validation.
+ **Getting dummy variables** for categorical featrues.

Then we choose many base models (mostly sklearn based models + sklearn API of DMLC's XGBoost and Microsoft's LightGBM), cross-validate them on the data before stacking/ensembling them. The key here is to make the (linear) models robust to outliears. This improved the result both on LB and cross-validations.

To my surprise, this does well on LB (0.11420 and top 4% the last time i tested it)

** Hope that at the end of this notebook, stacking will be clear for those, like myself, who found the concept not so easy to grasp**

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

import warnings
warnings.filterwarnings("ignore")

from scipy import stats
from scipy.stats import norm, skew

pd.set_option('display.float_format', lambda x : '{:.3f}'.format(x))

from subprocess import check_output
print(check_output(['ls', '../input']).decode('utf8'))

In [None]:
# now let's import and put the train and test datasets in pandas dataframe

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
#display the first five rows of the train dataset/
train.head(5)

In [None]:
#display the first five rows of the test dataset.
test.head(5)

In [None]:
# check the numbers of samples and featrues
print('the train data size before dropping ID feature is : {}'.format(train.shape))
print('the test data size before dropping ID feature is : {}'.format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the 'Id' column since it's unnecessary for the prediction process.
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

#check again the data size after dropping the 'Id' variable
print('\nThe train data size after dropping Id feature is : {}'.format(train.shape))
print('The test data size after dropping Id feature is : {}'.format(test.shape))


# Data Processing

## Outliers

Documentation for the Ames Housing Data indicates the there are outliers present in the training darta
<br><br>

Let's explore these outliers.

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

We can see at the bottom right two with extreamely large GrLivArea that are of a low price. These values are huge outliers. Therefore, we can safely delete them.

In [None]:
#Deleting outliers
train = train.drop(train[(train['GrLivArea'] > 4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

#### Note:
Outliers removal is note always safe. We decided to delete these two as they are very huge and really bad ( extreamely large areas for very low prices ).
<br>
There are probably others outliers in the training data. However, removing all them may affect badly our models if ever there were also outliers in the test data. That's why, instead of removing them all, we will just manage to make some of our models robust on them. You can refer to the modelling part of this notebook for that.

## Target Variable 

<br>
**SalePrice** is the variable we need to predict. So let's do some analysis on this variable first.

In [None]:
sns.distplot(train['SalePrice'], fit=norm)

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print('\n my = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# Now plot the distribution
plt.legend(['Normal dist. ($\mu=$) {:.2f} and $\sigma=$ {:2f}'.format(mu, sigma)],
          loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

# Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

The target variable is right skewed. As (linear) models love normally distributed data, we need to transform this variable and make it more normally distributed.

#### Log-transformation of the target variable

In [None]:
# We use the numpy function log1p which applies log(1+x) to all elements of the column
train['SalePrice'] = np.log1p(train['SalePrice'])

# Check the new distribution
sns.distplot(train['SalePrice'], fit=norm)

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

# Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f})'.format(mu, sigma)],
          loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

## 2019. 02. 24

The skew seems now corrected and the data appears more normally distributed.**

### Features engineering

let's