## Hello, my Kaggle friends.   
Today I start a new competition with a prediction of house prices.   
It seems like this task will be more difficult, than a computing of Titanic passenger survival probability.  
However, let's start with import libs and data.

In [None]:
#common
import numpy as np
import pandas as pd 
import IPython
from IPython.display import display
import warnings
warnings.simplefilter('ignore')

#visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.style as style
from matplotlib.colors import ListedColormap

from sklearn.metrics import SCORERS
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, LabelEncoder, OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import shuffle, resample
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA, IncrementalPCA

#regressors
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR, SVR
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool, cv

In [None]:
train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
subm = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')

#### Checking the datasets

In [None]:
train_df.info()

In [None]:
train_df.columns

The very first problem is that we have 80 features for prediction. And we have to pick only the important ones.   
The second question we must figure out are what to do with lots of missing values.  
And finally, we must turn all the 'sting' objects to numeric values.  

Ok, let's jump into it.

#### Pre-analysis.  

First, lets check the SalePrice column to clearly understand the distibution of prices.

In [None]:
train_df['SalePrice'].describe()

Mean price is around 180k USD, the most expensive house is for 775k USD and the cheapest is only for 34,9k USD. 50 quantile lies at 163k USD.  

Draw a distribution plot of prices.

In [None]:
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(train_df['SalePrice'], ax=ax)

ax.set_xlabel('House price, USD')
plt.suptitle('Price distribution', size=15)
plt.show()

In [None]:
len(train_df.query('SalePrice > 500000'))

Only nine houses have a price more than 500000 $, seems like we can drop them as outliers in the future.

### Preprocessing

As we mentioned above, there are a lot of missing values in train and test datasets. Using the description text file, we will gently replace all the "NAN"s with proper values.

In [None]:
len(train_df), len(test_df)

In [None]:
train_df.isna().sum().sort_values(ascending=False).head(10)

In [None]:
test_df.isna().sum().sort_values(ascending=False).head(10)

Four features (Pool quality, Misc Feature, Type of alley access and Fence quality) have more than 80% of missing values. It seems like these features don't affect the final sale price, and we may easily drop them from both datasets.

In [None]:
train_df = train_df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis=1)
test_df = test_df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence'], axis=1)

In [None]:
temp = train_df.isna().sum().sort_values()
temp[temp>=1]

In [None]:
temp = test_df.isna().sum().sort_values()
temp[temp>=1]

Let's combine two datasets and work with missing values faster.

In [None]:
full_df = pd.concat([train_df] + [test_df]).reset_index(drop=True)

In [None]:
full_df

Don't forget to save the indexes of primary datasets.

In [None]:
train_ind = train_df['Id']
test_ind = test_df['Id']

In [None]:
test_ind

In [None]:
full_df.head()

In [None]:
temp = full_df.isna().sum().sort_values()
temp[temp>=1]

##### Most common categorial features

Here we have some categorical features (such as FireplaceQu and GarageQual for example), some numeric features (LotFrontage and MasVnrArea). 
First, we figure out with categorical ones.

In [None]:
full_df['FireplaceQu'] = full_df['FireplaceQu'].fillna('None')
full_df['GarageQual'] = full_df['GarageQual'].fillna('None')
full_df['GarageFinish'] = full_df['GarageFinish'].fillna('None')
full_df['GarageCond'] = full_df['GarageCond'].fillna('None')
full_df['GarageType'] = full_df['GarageType'].fillna('None')
full_df['BsmtExposure'] = full_df['BsmtExposure'].fillna('None')
full_df['BsmtQual'] = full_df['BsmtQual'].fillna('None')
full_df['BsmtCond'] = full_df['BsmtCond'].fillna('None')
full_df['BsmtFinType2'] = full_df['BsmtFinType2'].fillna('None')
full_df['BsmtFinType1'] = full_df['BsmtFinType1'].fillna('None')
full_df['MasVnrType'] = full_df['MasVnrType'].fillna('None')
full_df['BsmtFinType2'] = full_df['BsmtFinType2'].fillna('None')

In [None]:
full_df.isna().sum().sort_values(ascending=False).head(20)

Keep in mind, that we dont need to fill SalePrice column! 

---

##### LotFrontage
Linear feet of street connected to property. What if this feature depends of LotArea (Lot size in square feet).

In [None]:
temp = full_df[['LotFrontage','LotArea']]

plt.figure(figsize=(10,6))
sns.scatterplot(x=temp['LotFrontage'], y=temp['LotArea'])
plt.title('Correlations between Lot Area and Lot Frontage', size=15);

print(temp.corr())

We will fill missing LotFrontage values with square root of LotArea.

In [None]:
full_df['LotFrontage'] = full_df['LotFrontage'].fillna(np.sqrt(full_df['LotArea']))

In [None]:
temp = full_df[['LotFrontage','LotArea']]

plt.figure(figsize=(10,6))
sns.scatterplot(x=temp['LotFrontage'], y=temp['LotArea'])
plt.title('Correlations between Lot Area and Lot Frontage with filled missing values', size=15);

print(temp.corr())

We can observe a clear line of new meanings. Let's see if it will affect the predictions in the future. 

---

##### Garages and cars

What year garages were built?

In [None]:
temp_year = full_df[['GarageYrBlt', 'YearBuilt']]

temp_year

In [None]:
plt.figure(figsize=(10,7))
sns.scatterplot(temp_year['YearBuilt'], temp_year['GarageYrBlt'])
plt.title('Were houses and garages built at the same time?', size=15);

Nope. We can see, that lot of garages were attached to old houses few years later from the building date.  
After 1980, almost all new houses have a garage by default.  
Look, somebody want to build a garage after 2200! We must to change it!


In [None]:
full_df.query('GarageYrBlt>2100')['GarageYrBlt']

Ah, what a pity mistake.

In [None]:
full_df.loc[full_df['GarageYrBlt'] == 2207,'GarageYrBlt'] = 2007

By the way, let's fill all the missing years with the date of the houses were built.

In [None]:
full_df['GarageYrBlt'] = full_df['GarageYrBlt'].fillna(full_df['YearBuilt'])

In [None]:
full_df.isna().sum().sort_values(ascending=False).head(10)

Garage cars and Garrage area, next, please.

In [None]:
full_df['GarageArea'] = full_df.groupby('GarageType')['GarageArea'].transform(lambda x: x.fillna(value=x.median()))

I think there should be a strong correlation between Garage Area and number of places for cars.

In [None]:
full_df['GarageCars'].corr(full_df['GarageArea'])

Yes!

In [None]:
full_df.loc[full_df['GarageCars'].isna()]['GarageArea']

This garage has a vast area, and we may predict it can accommodate...

In [None]:
full_df.loc[full_df['GarageArea'] == 400]['GarageCars'].value_counts()

...two cars.

In [None]:
full_df['GarageCars'] = full_df['GarageCars'].fillna(2)

##### Veneer area

In [None]:
full_df.loc[full_df['MasVnrArea'].isna()][['MasVnrArea', 'MasVnrType']]

Ok, we will replace missing Veneer area with O.

In [None]:
full_df['MasVnrArea'] = full_df['MasVnrArea'].fillna(0)

##### We need more different zones, Milord

In [None]:
full_df.loc[full_df['MSZoning'].isna()]

In [None]:
full_df['MSZoning'].value_counts()

We just fill missing Zoning values with 'RL'.

In [None]:
full_df['MSZoning'] = full_df['MSZoning'].fillna(value='RL')

##### Utilities

What about missing availible utilities? Let's check the year of the build.

In [None]:
full_df.loc[full_df['Utilities'].isna()]['YearBuilt'] 

What kind of utilities was available at those times?

In [None]:
print(full_df.loc[full_df['YearBuilt'] == 1910]['Utilities'].value_counts())
print(full_df.loc[full_df['YearBuilt'] == 1952]['Utilities'].value_counts())

Comfort houses, by the way. So, fill NANs with 'AllPub' values.

In [None]:
full_df['Utilities'] = full_df['Utilities'].fillna(value='AllPub')

##### Time to bath (not bass)

In [None]:
full_df['BsmtHalfBath'].value_counts()

In [None]:
full_df['BsmtFullBath'].value_counts()

In [None]:
full_df.query('BsmtHalfBath=="nan" or BsmtFullBath=="nan"')[['BsmtHalfBath', 'BsmtFullBath', 'YearBuilt']]

In [None]:
full_df.query('YearBuilt == 1959')['BsmtHalfBath'].value_counts()
#full_df.query('YearBuilt == 1946')['BsmtHalfBath']

Let's pretend, there are no bath at these houses.

In [None]:
full_df[['BsmtHalfBath', 'BsmtFullBath']] = full_df[['BsmtHalfBath', 'BsmtFullBath']].fillna(value=0)

Next, please.   
##### Functional

In [None]:
full_df.Functional.value_counts()

In [None]:
full_df['Functional'] = full_df['Functional'].fillna('Typ')

In [None]:
full_df.isna().sum().sort_values(ascending=False).head(10)

Square feets

In [None]:
full_df['BsmtFinSF2'].value_counts()

In [None]:
full_df['BsmtFinSF2'] = full_df['BsmtFinSF2'].fillna(0)

In [None]:
full_df.loc[full_df['BsmtFinSF1'].isna()]['BsmtFinType1']

In [None]:
full_df['BsmtFinSF1'] = full_df['BsmtFinSF1'].fillna(0)

In [None]:
full_df.loc[full_df['TotalBsmtSF'].isna(), 'BsmtFinSF1']

In [None]:
full_df[['TotalBsmtSF', 'BsmtFinSF1']]

In [None]:
full_df['TotalBsmtSF'].corr(full_df['SalePrice'])

In [None]:
full_df.isna().sum().sort_values(ascending=False).head(10)

In [None]:
full_df.loc[full_df['TotalBsmtSF'].isna()]['OverallQual']

In [None]:
full_df.loc[full_df['OverallQual']==4]['BsmtUnfSF'].value_counts()

In [None]:
full_df[['TotalBsmtSF','BsmtUnfSF']] = full_df[['TotalBsmtSF','BsmtUnfSF']].fillna(0)

Missing sale type

In [None]:
full_df['SaleType'].value_counts()

In [None]:
full_df['SaleType'] = full_df['SaleType'].fillna('WD')

What a beautiful exterior!

In [None]:
full_df.loc[full_df['Exterior2nd'].isna()][['Exterior2nd','Exterior1st','YearBuilt']]

This house was built in 1940. Which type of material was more popular at that time?

In [None]:
full_df.loc[full_df['YearBuilt'] == 1940][['Exterior1st', 'Exterior2nd', 'MSZoning']]

In [None]:
full_df.loc[full_df['YearBuilt'] == 1940]['Exterior1st'].value_counts()

In [None]:
full_df.loc[full_df['YearBuilt'] == 1940]['Exterior2nd'].value_counts()

As we can see, materials for both exteriors are the same as usual. Wood and metal were the most common materials.

Let's pretend, in this case, there are metal siding.

In [None]:
full_df[['Exterior1st','Exterior2nd']] = full_df[['Exterior1st','Exterior2nd']].fillna('MetalSd')

##### Air is electrising!

In [None]:
full_df.loc[full_df['Electrical'].isna()]['YearBuilt']

This house is almost new.

In [None]:
full_df.loc[full_df['YearBuilt'] == 2006]['Electrical'].value_counts()

There is no other options.

In [None]:
full_df['Electrical'] = full_df['Electrical'].fillna(value='SBrkr')

##### Finaly, time for the most important area into entire house!

In [None]:
full_df.loc[full_df['KitchenQual'].isna()]['YearBuilt']

In [None]:
full_df.loc[full_df['YearBuilt']==1917][['KitchenQual', 'OverallCond']]

In [None]:
full_df.loc[full_df['OverallCond']==3]['KitchenQual'].value_counts()

Ok, we just fill the last missing value with 'TA'.

In [None]:
full_df['KitchenQual'] = full_df['KitchenQual'].fillna(value='TA')

Checking the full dataset.

In [None]:
full_df.isna().sum().sort_values()

Good, only price values, we must predict, are still missing.

### Feature selection  

For the first try, let's choose important features manually.

In [None]:
full_df_ref_man = full_df[[
                           'Street',
                           'Exterior1st',
                           'KitchenQual',
                           'Heating',
    
                           'MSZoning',
                           'YearBuilt',
                           'Neighborhood',
                           'Condition1',
                           'BldgType',
                           'HouseStyle',
                           'OverallQual',
                           'OverallCond',
                           'ExterQual',
                           'ExterCond', 
                           'BsmtQual',
                           'BsmtCond',
                           'CentralAir',
                           'HeatingQC',
                           'Electrical',
                           '1stFlrSF',
                           '2ndFlrSF',
                           'GrLivArea',
                           'FullBath',
                           'BedroomAbvGr',
                           'KitchenAbvGr',
                           'Functional',
                           'GarageType',
                           'GarageQual',
                           'OpenPorchSF',
                           'PoolArea',
                           'SaleType',
                           'SaleCondition',
                           'SalePrice'
                          ]]

In [None]:
full_df_ver2 = full_df[[
                            ### This features were added during the last attempt ###
                           'LotFrontage',
                           'LotArea',
                           'Condition2',
                           'YearRemodAdd',
                           'MasVnrArea',
                           'BsmtFinType1',
                           'TotalBsmtSF',
                           'TotRmsAbvGrd',
                           'Fireplaces',
                           'GarageYrBlt',
                           'GarageCars',
    
                            ### Current best result was performed with these features ### 
                           'Street',
                           'Exterior1st',
                           'KitchenQual',
                           'Heating',
                            
                            ### I also removed some features from the first list ###
                           'MSZoning',
                           'YearBuilt',
                           'Neighborhood',
                           'Condition1',
                           'BldgType',
                           'HouseStyle',
                           'OverallQual',
                           'OverallCond',
                           'ExterQual',
                           'ExterCond', 
                           'BsmtQual',
                           'BsmtCond',
                           'CentralAir',
                           'HeatingQC',
                           'Electrical',
                           '1stFlrSF',
                           '2ndFlrSF',
                           'GrLivArea',
                           #'FullBath',
                           #'BedroomAbvGr',
                           #'KitchenAbvGr',
                           'Functional',
                           'GarageType',
                           #'GarageQual',
                           #'OpenPorchSF',
                           #'PoolArea',
                           'SaleType',
                           'SaleCondition',
                           'SalePrice'
                          ]]

In [None]:
full_df_ver5 = full_df[[
                            ### This features were added during the last attempt ###
                           'LotFrontage',
                           'LotArea',
                           'Condition2',
                           'YearRemodAdd',
                           'MasVnrArea',
                           'BsmtFinType1',
                           'TotalBsmtSF',
                           'TotRmsAbvGrd',
                           'Fireplaces',
                           'GarageYrBlt',
                           'GarageCars',
    
                            ### Current best result was performed with these features ### 
                           'Street',
                           'Exterior1st',
                           'KitchenQual',
                           'Heating',
                            
                            ### I also removed some features from the first list ###
                           'MSZoning',
                           'YearBuilt',
                           'Neighborhood',
                           'Condition1',
                           'BldgType',
                           'HouseStyle',
                           'OverallQual',
                           'OverallCond',
                           'ExterQual',
                           'ExterCond', 
                           'BsmtQual',
                           'BsmtCond',
                           'CentralAir',
                           'HeatingQC',
                           'Electrical',
                           '1stFlrSF',
                           '2ndFlrSF',
                           'GrLivArea',
                           'FullBath',
                           'BedroomAbvGr',
                           'KitchenAbvGr',
                           'Functional',
                           'GarageType',
                           'GarageQual',
                           'OpenPorchSF',
                           'PoolArea',
                           'SaleType',
                           'SaleCondition',
                           'SalePrice'
                          ]]

In [None]:
full_df_ref_man.index = full_df["Id"]
full_df_ver2.index = full_df['Id']
full_df_ver5.index = full_df['Id']

### Features engeneering 

Let's add mode features, engeneered from existed ones.

In [None]:
full_df_ver3 = full_df_ver2.copy()

#### Split years of сonstruction into bins.

In [None]:
full_df_ver3['YearBuilt'].corr(full_df_ver3['SalePrice'])

In [None]:
temp = full_df_ver3[['YearBuilt','SalePrice']].groupby('YearBuilt', as_index=False).median()

sns.set_style('whitegrid')
fig, axes = plt.subplots(2,1, sharex=True, figsize=(10,12))

sns.distplot(full_df_ver3['YearBuilt'], kde=False, ax=axes[0], color='black')
sns.lineplot(x=temp['YearBuilt'], y=temp['SalePrice'], ax=axes[1], color='dimgray')

axes[0].set_xlabel('')
axes[1].set_xlabel('Construction date', size=12)
axes[1].set_ylabel('Median price', size=12)
axes[0].set_ylabel('Saturation', size=12)

plt.suptitle('Year of construction and Price distributions', size=18, y=(0.91));

We will divide all dates into four bins (<1900, 1900-1930, 1930-1980, 1980-2010).

In [None]:
def yearblt_bin(row):
    
    row = row['YearBuilt']
    
    if row <=1900 :
        return 'very old'
    if 1900 < row <= 1930:
        return 'old'
    if 1930 < row <= 1980:
        return 'moderate'
    else:
        return 'new'
    

full_df_ver3['YearBins'] = full_df_ver3.apply(yearblt_bin, axis=1)

In [None]:
full_df_ver3['YearBins']

#### Living Area bins

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(full_df_ver3['GrLivArea'], bins=50, color='black', kde=False);

Also, split the distribution into four bins: 0-800, 800-1700, 1700-2900, 2900-max.

In [None]:
def area_bin(row):
    
    row = row['GrLivArea']
    
    if row <= 800 :
        return 'small'
    if 800 < row <= 1700:
        return 'medium'
    if 1700 < row <= 2900:
        return 'large'
    else:
        return 'extra_large'
    

full_df_ver3['AreaBins'] = full_df_ver3.apply(area_bin, axis=1)

In [None]:
full_df_ver3['AreaBins'].value_counts()

In [None]:
full_df_ver3 = full_df_ver3.drop(['GrLivArea', 'YearBuilt'], axis=1)

## Warning

This approach doesn't work!   
Declined.

---


#### Polynomial features  

It's time to add some polynoms with high correlations to the target values.

In [None]:
full_df_pol = full_df_ver2.copy()
#full_df_pol = full_df_pol.drop(['Condition2','BsmtFinType1','SaleType'], axis=1)

full_df_pol['OverallQual*2'] = full_df_pol['OverallQual']*2
#full_df_pol['GrLivArea*2'] = full_df_pol['GrLivArea']*2
#full_df_pol['RoomArea'] = full_df_pol['GrLivArea'] / full_df_pol['TotRmsAbvGrd'] 


#### Features encoding 

Using dummy encoding, we will replace all categotial features with 1 and 0 values.

In [None]:
full_df_upd_0 = pd.get_dummies(full_df_ref_man, drop_first=True)
full_df_enc_2 = pd.get_dummies(full_df_ver2, drop_first=True)
full_df_pol_2 = pd.get_dummies(full_df_pol, drop_first=True)
full_df_upd_3 = pd.get_dummies(full_df_ver3, drop_first=True)
full_df_ver5 = pd.get_dummies(full_df_ver5, drop_first=True)

Also, for some gradient boosting machines, let's encode categorial string values to integer ones.

In [None]:
enc = OrdinalEncoder()

In [None]:
full_df_ver2.columns

In [None]:
full_df_ver3.columns

In [None]:
cat_features = ['LotFrontage', 'Condition2',
       'BsmtFinType1', 'Fireplaces', 'SaleType', 'SaleCondition', 'Street',
       'Exterior1st', 'KitchenQual', 'Heating', 'MSZoning', 
       'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
       'CentralAir', 'HeatingQC', 'Electrical', 'Functional', 'GarageType', 'SaleCondition']

cat_features_3 = ['LotFrontage', 'Condition2',
       'BsmtFinType1', 'Fireplaces', 'SaleType', 'SaleCondition', 'Street',
       'Exterior1st', 'KitchenQual', 'Heating', 'MSZoning', 
       'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
       'CentralAir', 'HeatingQC', 'Electrical', 'Functional', 'GarageType', 'SaleCondition', 'YearBins',
       'AreaBins']

In [None]:
full_df_ver2_cat = full_df_ver2.copy()
full_df_ver2_cat[cat_features] = enc.fit_transform(full_df_ver2_cat[cat_features]).astype('int')

full_df_ver3_cat = full_df_ver3.copy()
full_df_ver3_cat[cat_features_3] = enc.fit_transform(full_df_ver3_cat[cat_features_3]).astype('int')

Divide full dataset into train and test subsets again. Also pick out the target values ('SalePrice')

In [None]:
RND_ST = 42

In [None]:
X_train_2 = full_df_enc_2.query('index in @train_ind').drop(['SalePrice'], axis=1).reset_index(drop=True)
X_test_2 = full_df_enc_2.query('index in @test_ind').drop(['SalePrice'], axis=1).reset_index(drop=True)

X_train_cat = full_df_ver2_cat.query('index in @train_ind').drop(['SalePrice'], axis=1).reset_index(drop=True).astype('int')
X_test_cat = full_df_ver2_cat.query('index in @test_ind').drop(['SalePrice'], axis=1).reset_index(drop=True).astype('int')

X_train_3 = full_df_upd_3.query('index in @train_ind').drop(['SalePrice'], axis=1).reset_index(drop=True).astype('int')
X_test_3 = full_df_upd_3.query('index in @test_ind').drop(['SalePrice'], axis=1).reset_index(drop=True).astype('int')

X_train_3_cat = full_df_ver3_cat.query('index in @train_ind').drop(['SalePrice'], axis=1).reset_index(drop=True).astype('int')
X_test_3_cat = full_df_ver3_cat.query('index in @test_ind').drop(['SalePrice'], axis=1).reset_index(drop=True).astype('int')

y_train = full_df_upd_0.query('index in @train_ind')['SalePrice'].reset_index(drop=True)


### Validation subsets

#X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X_train_0, y_train, test_size=0.2, random_state=RND_ST) 

X_train_sub_2, X_valid_sub_2, y_train_sub_2, y_valid_sub_2 = train_test_split(X_train_2, y_train, test_size=0.2, random_state=RND_ST) 


X_train_sub_c, X_valid_sub_c, y_train_sub_c, y_valid_sub_c = train_test_split(X_train_cat, y_train, test_size=0.2, random_state=RND_ST) 
X_train_sub_3, X_valid_sub_3, y_train_sub_3, y_valid_sub_3 = train_test_split(X_train_3, y_train, test_size=0.2, random_state=RND_ST) 

#X_train_sub_3, X_valid_sub_3, y_train_sub_3, y_valid_sub_3 = train_test_split(X_train_3, y_train, test_size=0.2, random_state=RND_ST) 
X_train_sub_3c, X_valid_sub_3c, y_train_sub_3c, y_valid_sub_3c = train_test_split(X_train_3_cat, y_train, test_size=0.2, random_state=RND_ST) 

In [None]:
X_train_5 = full_df_ver5.query('index in @train_ind').drop(['SalePrice'], axis=1).reset_index(drop=True)
X_test_5 = full_df_ver5.query('index in @test_ind').drop(['SalePrice'], axis=1).reset_index(drop=True)

X_train_sub_5, X_valid_sub_5, y_train_sub_5, y_valid_sub_5 = train_test_split(X_train_5, y_train, test_size=0.2, random_state=RND_ST) 

### Model selection  

Looking for the best hyperparameters.

In [None]:
def mae(model, X_train, X_test, y_train, y_test):
    
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    print('MAE train = ', mean_absolute_error(y_train, y_train_pred))
    print('MAE test = ', mean_absolute_error(y_test, y_test_pred))

In [None]:
RND_ST = 42

In [None]:
### Random Forest Regressor ###

rfr = RandomForestRegressor(n_jobs=-1, random_state=RND_ST)

params_rfr = dict(n_estimators=range(10,500,10),
                  max_features=range(5, 30),
                  max_leaf_nodes = [1,5,10,20])


### Gradient Boosting Regressor ###

gbr = GradientBoostingRegressor(random_state=RND_ST)

params_gbr = dict(n_estimators=range(200,1000,5),
                  max_features=range(5, 40),
                  max_depth=[0,2,3,4],
                  learning_rate = [0.01, 0.1, 0.5, 1],
                  )

params_gbr_nest = dict(n_estimators=range(200,900,5))

params_gbr_other = dict(max_features=range(10, 40),
                        max_depth=[2,3,4],
                        learning_rate = [0.1, 0.3, 1]
                        #max_features = ['auto', 'sqrt', 'log2']
                       )


### CatBoost ###

catboost_train = Pool(X_train_sub_c, y_train_sub_c, cat_features=cat_features)
catboost_train_full = Pool(X_train_cat, y_train, cat_features=cat_features)

catboost_train_3 = Pool(X_train_sub_3c, y_train_sub_3c, cat_features=cat_features_3)
catboost_train_full_3 = Pool(X_train_3_cat, y_train, cat_features=cat_features_3)

---

### In this version we will try to implement CatBoost.


#### CatBoost

In [None]:
catboost_1 = CatBoostRegressor(
                          iterations=720, 
                          depth=4, 
                          learning_rate=0.09, 
                          loss_function='MAE', 
                          subsample=0.8,
                          grow_policy='Depthwise',
                          l2_leaf_reg=2,
                          rsm=0.9,
                          verbose=0, 
                          random_seed=RND_ST
    )

In [None]:
catboost_1.fit(X_train_sub_c, y_train_sub_c)

cat_y_tr = catboost_1.predict(X_train_sub_c)
cat_y_val = catboost_1.predict(X_valid_sub_c)

print('Train mae = ', mean_absolute_error(y_train_sub_c, cat_y_tr))
print('Valid mae = ', mean_absolute_error(y_valid_sub_c, cat_y_val))

In [None]:
### CatBoost best
### Train mae =  6562.590378143246
### Valid mae =  16061.543780248663


## Stacking  

Try to apply a basic stacking method. 

We will split a train features on tho equal subsets. Fit the basic models on the first subset. Then predict the target ON THE SECOND train subset and join the predictions TO THE second subset. Repeat this process. prediction and joining, with the test subset.  
It looks like advanced feature engeneering.  

Then we will built the meta-regressor, fit it on the updated second train dataset and predict the sale price on the updated test dataset. 

For more information about Stacking, please [check this article](https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/).

Split train set into two subsets.

In [None]:
X_train_stack_1, X_train_stack_2, y_train_stack_1, y_train_stack_2 = train_test_split(
                                                                        X_train_cat, y_train, test_size=0.5, random_state=RND_ST)

Set a list of basic regressors.

In [None]:
lr = LinearRegression(n_jobs=-1) 

rfr_1 = RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_split=3, n_jobs=-1, random_state=RND_ST)

rfr_2 = RandomForestRegressor(n_estimators=200, max_depth=4, min_samples_split=4, n_jobs=-1, random_state=RND_ST)

rfr_3 = RandomForestRegressor(n_estimators=300, max_depth=5, min_samples_split=5, n_jobs=-1, random_state=RND_ST)

gbr_1 = GradientBoostingRegressor(n_estimators=300, max_depth=3, learning_rate=0.1, subsample=0.9, random_state=RND_ST)

gbr_2 = GradientBoostingRegressor(n_estimators=400, max_depth=4, learning_rate=0.09, subsample=0.8, random_state=RND_ST)

In [None]:
models = [lr, rfr_1, rfr_2, rfr_3, gbr_1, gbr_2]
names = ['lr', 'rfr_1', 'rfr_2', 'rfr_3', 'gbr_1', 'gbr_2']

Fit the basic models.

In [None]:
for model in models:
    model.fit(X_train_stack_1, y_train_stack_1)

In [None]:
X_train_stack_2_upd = X_train_stack_2.copy()

In [None]:
def pred_stack(model, feat, df_upd, name):
    
    pred = pd.Series(model.predict(feat).astype('int'), name=name, index=feat.index)
    
    df_upd = df_upd.join(pred)
    
    return df_upd

Stacking on training set.

In [None]:
for model, name in zip(models, names):
    
    X_train_stack_2_upd = pred_stack(model, X_train_stack_2, X_train_stack_2_upd, name)

In [None]:
X_train_stack_2_upd

Stacking on test subset.

In [None]:
X_test_upd = X_test_cat.copy()

for model, name in zip(models, names):
    
    X_test_upd = pred_stack(model, X_test_cat, X_test_upd, name)

In [None]:
X_test_upd

Create a meta-regressor. We will implement a Catboost regressor.

In [None]:
catboost_stack = CatBoostRegressor(iterations=700, 
                          depth=4, 
                          learning_rate=0.09, 
                          loss_function='MAE', 
                          subsample=0.8,
                          grow_policy='Depthwise',
                          l2_leaf_reg=2,
                          rsm=0.9,
                          verbose=0, 
                          random_seed=RND_ST)

In [None]:
X_train_stack_2_upd

In [None]:
catboost_stack.fit(X_train_stack_2_upd, y_train_stack_2)

In [None]:
pred = catboost_stack.predict(X_train_stack_2_upd)

mean_absolute_error(y_train_stack_2, pred)

Make a Catboost cross-validation.

In [None]:
pool = Pool(X_train_stack_2_upd, y_train_stack_2)

In [None]:
params = dict(iterations=500, 
                          depth=7, 
                          learning_rate=0.09, 
                          loss_function='MAE', 
                          subsample=0.8,
                          grow_policy='Depthwise',
                          l2_leaf_reg=2,
                          rsm=0.9,
                          verbose=0, 
                          #early_stopping_rounds=20,
                          random_seed=RND_ST)

scores = cv(pool,
            params,
            fold_count=2, 
            plot="True")

In [None]:
catboost_stack = CatBoostRegressor(iterations=700, 
                          depth=4, 
                          learning_rate=0.09, 
                          loss_function='MAE', 
                          subsample=0.8,
                          grow_policy='Depthwise',
                          l2_leaf_reg=2,
                          rsm=0.9,
                          verbose=0, 
                          random_seed=RND_ST)

## Feature importances

Range the importance of features by catboost.

In [None]:
imp = catboost_stack.feature_importances_
names = X_train_stack_2_upd.columns.tolist()

important = pd.DataFrame(columns=['imp', 'names'])

important['imp'] = imp
important['names'] = names

important = important.sort_values(by='imp', ascending=False).reset_index(drop=True)

important

Remove the features, which give us less then 0.3% impact.

In [None]:
upd_columns = important['names'][:25]

In [None]:
X_train_stack_2_upd_cols = X_train_stack_2_upd[upd_columns]
X_test_upd_cols = X_test_upd[upd_columns]

In [None]:
catboost_stack.fit(X_train_stack_2_upd_cols, y_train_stack_2)

In [None]:
pred = catboost_stack.predict(X_train_stack_2_upd)

mean_absolute_error(y_train_stack_2, pred)

### Make a prediction, create the submission file.

#### Prediction for sklearn models

In [None]:
def prediction(model, feat_tr, feat_test, targ_tr):
    
    model.fit(feat_tr, targ_tr)
    pred_final = pd.DataFrame((model.predict(feat_test)), columns=['SalePrice'])
    
    return(pred_final)

In [None]:
pred = np.around(prediction(catboost_stack, X_train_stack_2_upd_cols, X_test_upd_cols, y_train_stack_2))

submission = pd.DataFrame(subm['Id'])

submission = submission.join(pred)

submission.to_csv('/kaggle/working/cb_new_08.csv', index=False)

In [None]:
submission.head()

#### Prediction for boosting models

mod = catboost_1.fit(catboost_train_full_pol)

pred_fin = pd.DataFrame(np.around(mod.predict(X_test_pol_cat)), columns=['SalePrice'])


submission = pd.DataFrame(subm['Id'])

submission = submission.join(pred_fin)

submission.to_csv('/kaggle/working/catboost_l2.csv', index=False)

In [None]:
submission.head()

## Scoreboard 

0.12625 - Rank 1270 - catboost_stack, X_features_stack_2_upd,  
0.12845 - Rank 1638 - catboost_l2, X_train_cat  
0.12868 - Rank 1645 - catboost, X_train_cat  
0.12886 - Rank 1654 - catboost, X_train_cat  
0.12910 - Rank 1683 - gbr_new_2, X_train_2  
0.13866 - Rank 2403 - gbr_new, X_train_0 (more_features)  
0.13934 - Rank 2433 - catboost, X_train_c  
0.14346 - Rank 2755 - model_gbr_ with updated params, X_train_0 + Year feature.  
0.14631 - Rank 2922 - model_gbr with updated params, X_train_0  
0.15217 - Rank 3330 - model_gbr, X_train_0  
0.20628 - Rank 4340 - very first try, with no features engeneering and just Random Forest Regressor

#### Best models  

gbr_new = GradientBoostingRegressor(n_estimators=265, max_depth=4, max_features=28, random_state=RND_ST)    

gbr_new_2 = GradientBoostingRegressor(n_estimators=385, max_depth=3, max_features=24, random_state=RND_ST) 

catboost_1 = CatBoostRegressor(
                          iterations=700, 
                          depth=4, 
                          learning_rate=0.09, 
                          loss_function='MAE', 
                          subsample=0.8,
                          grow_policy='Depthwise',
                          l2_leaf_reg=2,
                          rsm=0.9,
                          verbose=0, 
                          random_seed=RND_ST
    )
