

💪💪💪 **I will be very happy to get your upvote for this kernel and the notebook, enjoy** 

💪💪💪 **Also, check my "Top 3% Titanic solution" (https://www.kaggle.com/nikitakudriashov/top-4-titanic-solution)** 

<h1 style="font-size:200%;color:#2f4c28"> House Prices - Advanced Regression Techniques </h1>

Starting this solution I was drowning in all the 79 variables describing (almost) every aspect of residential homes. Which do I have to start from? Or is it better just to take all of them end let it go as it goes.

So in this case I developed such approach for this task:
1. **General EDA** with general data preparation - on this step I just tried to get some general information about the data and in some cases drop some data. After that a got the idea of separating all the attributes into several groups by their meanings and their nature.

2. On the second part of EDA (actually the EDA + feature engineering)  I decided to analyze the **values in every group separately**. During this part I analyzed the data, filled None values and created new in-group features.

3. After the in-group operations with data I decided to create some **new features, based on all 79 of them, and on some intergroup relations**.

4. Fourth part is the **modeling** - I just create several variants, based on different models and put everything together. 

I hope such structure will be usefull for others who try to solve this task. Also, i spent some time to create good and strainforward visualisation for you to be comfortable with reading this.

**Let's start**


In [None]:
from sklearn.preprocessing import *
from sklearn.cluster import DBSCAN, KMeans
from sklearn.manifold import TSNE
from sklearn.linear_model import LinearRegression, ElasticNetCV, LassoCV, RidgeCV
from sklearn.ensemble import IsolationForest,GradientBoostingRegressor
from sklearn.model_selection import ShuffleSplit, GridSearchCV

from IPython.core.display import display, HTML
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
import missingno
import markdown

from scipy import stats
import pandas as pd
import numpy as np

from mlxtend.regressor import StackingCVRegressor,StackingRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

html = markdown.markdown('<img style="max-height: 700px; \
                           align=center" \
                           src="https://sun9-53.userapi.com/impg/ETdrzYnU_3MnOd3fkaLI8oWCBHt9XNDCgo1dUQ/Ua2rbr3Wt1k.jpg?size=1000x400&quality=96&sign=38aa16cdd20dea8825a2c2b640ea6108&type=album">')
display(HTML(html))

palette = ['#2f4c28',"#c3762c","#b14404",'#3a6049',"#8f6e60","#a03031"]
gradient = ['#2f4c28','#3a6049','#4e7f76',"#8f6e60",'#a96d46',"#c3762c","#b14404","#a03031"]
sns.palplot(gradient)

<a style="font-size:200%;color:#2f4c28">Table Of Content
* [<a style="font-size:150%;color:#2f4c28">0. EDA / Data Preparation (Primary)](#0_bullet)
* [<a style="font-size:150%;color:#2f4c28">1. EDA / Feature engeneering (Attributes)](#1_bullet)
    * [<a style="font-size:130%;color:#b14404"> 1.1 Qualitative features](#1.1_bullet)
      -Desctibes the lot from the qualitative point of view (ordinal categorical data).
    * [<a style="font-size:130%;color:#b14404"> 1.2 Environment features](#1.2_bullet)
      -Describes the location, area and the environment condition of the lot.
    * [<a style="font-size:130%;color:#b14404"> 1.3 Selling terms features](#1.3_bullet)
      -Describes the particular qualities of the selling process.
    * [<a style="font-size:130%;color:#b14404"> 1.4 Land plot features](#1.4_bullet)
      -Describes the land quality features.
    * [<a style="font-size:130%;color:#b14404"> 1.5 Dwelling features](#1.5_bullet)
      -Describes the house in objective values   
    * [<a style="font-size:130%;color:#b14404"> 1.6 Benefits features](#1.6_bullet)
      -Describes lot benefits or utilities
    * [<a style="font-size:130%;color:#b14404"> 1.7 Square feet features](#1.7_bullet)
      -Describes sqr feets of the lot areas
    * [<a style="font-size:130%;color:#b14404"> 1.8 Basement features](#1.8_bullet)
      -Describes the basement quality and other features
    * [<a style="font-size:130%;color:#b14404"> 1.9 Garage features](#1.9_bullet)
      -Describes the garage features
    * [<a style="font-size:130%;color:#b14404"> 1.10 Other features](#1.10_bullet)
      -Describes the misk features
    * [<a style="font-size:130%;color:#b14404"> 1.11 Clusterization](#1.10_bullet)
      -General classes of the data  
* [<a style="font-size:150%;color:#2f4c28">2. Modeling](#2_bullet)
    * [<a style="font-size:130%;color:#b14404"> 2.1 Data Preprocessing](#2.1_bullet)
    * [<a style="font-size:130%;color:#b14404"> 2.2 Linear models](#2.2_bullet)
    * [<a style="font-size:130%;color:#b14404"> 2.3 Advanced models](#2.3_bullet)
    * [<a style="font-size:130%;color:#b14404"> 2.4 Mixture models](#2.4_bullet)
    * [<a style="font-size:130%;color:#b14404"> 2.5 Avaraging](#2.5_bullet)


# <a class="anchor" id="0_bullet" style="color:#2f4c28"> 0. EDA / Data Preparation (Primary) </a>
----
----

In [None]:
path = "../input/house-prices-advanced-regression-techniques"
df = pd.concat([pd.read_csv(f"{path}/train.csv",index_col=0),
                pd.read_csv(f"{path}/test.csv",index_col=0)])
df = df.reset_index(drop=True)
nan_mask = df["SalePrice"].isna()
print("\033[4mDATA SHAPE CHECK:\033[0m")
print(f" Train dataset length: \t{len(df[~nan_mask])}")
print(f" Train dataset length: \t{len(df[nan_mask])}")
print(f" Number of features: \t{len(df.columns)-1}")

In [None]:
print("\033[4mFEATURE DESCRIPTION EXTRACTION:\033[0m")
f = open(f"{path}/data_description.txt").read().split('\n')
descriptions = {}
for s in f:
    s=s.strip()
    if (":" in s) and not ("2nd level" in s):
        k,v = s.split(":")
        print(f"  {k} \t:{v}")
        descriptions[k] = v 

In [None]:
print("\033[4mATTRIBUTES TYPES:\033[0m")
categorical, numerical = [],[]
for c in df.columns:
    t = df.dtypes[c]
    if t=='object':
        categorical.append(c)
    else:
        numerical.append(c)
print("\n\033[4mCATEGORICAL:\033[0m")
print(categorical)
print("\n\033[4mNUMERICAL:\033[0m")
print(numerical)

In [None]:
print("\033[4mNAN VALUES:\033[0m")
_df = df.loc[:,df.isna().any().values]
_df = _df.drop("SalePrice",axis=1)
missingno.bar(_df, color=gradient, figsize=(30,2), sort="ascending")
display(df.head(3))

In [None]:
less_50p_nans = ["Alley", "PoolQC", "Fence", "MiscFeature", "FireplaceQu"]
print("Attributes that have more then 50% NaN values:\n")
for k in less_50p_nans:
    print(f"{k}   \t:{descriptions[k]}")

In [None]:
# We only leave MiscFeature cause it can affec to the SalePrice
df = df.drop(["Alley","PoolQC","Fence","FireplaceQu"], axis=1)

We can see that original target contains a bunch of outliers. \
It will complicate our visualisation and analysis. \
In this case, we will transform out target into normal dstrebution with the QuantileTransformer

In [None]:
target_scaler = QuantileTransformer(output_distribution='normal', random_state=0)
df.loc[~nan_mask,"SalePrice_transformed"] = \
    target_scaler.fit_transform(df.loc[~nan_mask,"SalePrice"].to_numpy().reshape(-1,1))

fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Violin(y = df.loc[~nan_mask,"SalePrice"], 
                        line={"color":palette[0]}, name="original target"), row=1, col=1)
fig.add_trace(go.Violin(y = df.loc[~nan_mask,"SalePrice_transformed"], 
                        line={"color":palette[1]}, name="scaled target"), row=1, col=2)
fig.update_traces(meanline_visible=True)
fig.show()

In [None]:
attributes_grouped = {
    "qualitative" : ["OverallQual","OverallCond","ExterQual","ExterCond","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","HeatingQC",
                     "LowQualFinSF","KitchenQual","GarageQual","GarageCond"],
    "environment" : ["MSZoning","LotFrontage","Street","Neighborhood","Condition1","Condition2","PavedDrive"],
    "sellterms"   : ["MoSold","YrSold","SaleType","SaleCondition"],
    "landplot"    : ["LotArea","LotShape","LandContour","LotConfig","LandSlope"],
    "dwelling"    : ["BldgType","HouseStyle","YearBuilt","YearRemodAdd","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","MasVnrArea",
                     "Foundation","KitchenQual","Fireplaces","BsmtFullBath","BsmtHalfBath","FullBath","HalfBath","MSSubClass"],
    "benefits"    : ["Heating","CentralAir","Electrical","Utilities","Functional","TotRmsAbvGrd"],
    "sqrfeets"    : ["1stFlrSF","2ndFlrSF","LowQualFinSF","GrLivArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch","ScreenPorch","PoolArea"],
    
    # Repeats some of the qualitative attributes
    "basement"    : ["BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinSF1","BsmtFinType2","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF"],
    "garage"      : ["GarageType","GarageYrBlt","GarageFinish","GarageCars","GarageArea","GarageQual","GarageCond"],
    "other"       : ["MiscVal","MiscFeature"]}

print("\033[4mATTRIBURTES GROUPS:\033[0m")
for k in attributes_grouped:
    print(f"\n\033[1m{k}\033[0m:")
    print(",".join(attributes_grouped[k]))

In the end of this part we will initialise some usefull methods to wirk with attributes groups:
* **save_to_df** - updates the general DataFrame with processed group attributes
* **replace_dummies** - replaces given feature with dummies values
* **attribute_slice** - gets the subframe of group attributes

In [None]:
def save_to_df(group_name:str, new_df:pd.DataFrame):
    """
    Saves processed dataset of some group of features into main df
    :param group_name: name of the feature group
    :param new_df:     precessed DataFrame features
    """
    if "target" in new_df:
        new_df = new_df.drop("target", axis=1)
    global df
    df = df.drop(attributes_grouped[group_name], axis=1)
    df = pd.concat([df,new_df], axis=1)
    attributes_grouped[group_name] = new_df.columns.values
    return attributes_grouped[group_name]
        
def replace_dummies(name , df):
    """
    Replace feature with dummies
    :param name: str feature name
    :param df:   pd.DataFrame data
    """
    _d = pd.get_dummies(df[name])
    _d = _d.rename({i:f"{name}_{i}" for i in _d.columns}, axis=1)
    df = df.drop(name, axis=1)
    return pd.concat([df,_d],axis=1)

def attribute_slice(attributes_group):
    attr_list = attributes_grouped[attributes_group]
    _df = df.copy()[attr_list]
    _df["target"] = df["SalePrice"]
    missingno.bar(_df, color=palette, figsize=(30,2))
    display(_df.head(3))
    return attr_list, _df

# <a class="anchor" id="1_bullet" style="color:#2f4c28"> 1. EDA / Feature engeneering (Attributes) </a>
----
----

# <a class="anchor" id="1.1_bullet" style="color:#b14404"> 1.1 Qualitative features </a>
---

In [None]:
attr_list,_df = attribute_slice("qualitative")

We will reaplce categorical values manually, for not to loose there connections.

In [None]:
categorical = {"Gd":3,"Av":2,"Mn":1,"No":0}
_df.loc[:,"BsmtExposure"] = _df["BsmtExposure"].replace(categorical)

categorical = {"Ex":5,"Gd":4,"TA":3,"Fa":2,"Po":1,"No":0}
_df = _df.replace(categorical)

categorical = {"GLQ":6,"ALQ":5,"BLQ":4,"Rec":3,"LwQ":2,"Unf":1}
_df = _df.replace(categorical)

_df = _df.fillna(0).astype(int)

## <a class="anchor" style="color:#8f6e60"> General Quality weak regressor</a>
---
After a bunch of prunings I've chosen the set of features for general quality feature.

In [None]:
_model = LinearRegression()
_model.fit(_df[~nan_mask][attr_list],_df[~nan_mask]["target"])
_df["_GeneralQuality"] = QuantileTransformer().fit_transform(_model.predict(_df[attr_list]).reshape(-1,1))

In [None]:
pltdf = _df[~nan_mask].copy()
fig = px.scatter(pltdf, x="_GeneralQuality", y="target",color="target",height=400, color_continuous_scale=gradient)
fig.show()

In [None]:
save_to_df("qualitative", _df)
_df.head(1)

# <a class="anchor" id="1.2_bullet" style="color:#b14404"> 1.2 Environment features </a>
---------------------------

In [None]:
attr_list,_df = attribute_slice("environment")

In [None]:
_df = replace_dummies("MSZoning",_df)

In [None]:
_df.loc[_df["LotFrontage"].isna(),"LotFrontage"] = _df.mode()["LotFrontage"].values[0]
_df["LotFrontage"] = QuantileTransformer().fit_transform(_df["LotFrontage"].to_numpy().reshape(-1,1))
px.scatter(_df, x="LotFrontage", y="target", color="target", height=400, color_continuous_scale=gradient)

In [None]:
_df["_LfOverNormal"] = 0
_df.loc[_df["LotFrontage"]>0.8,"_LfOverNormal"] = 1
px.box(_df, color="_LfOverNormal", x="target", height=400, color_discrete_sequence=palette)

In [None]:
# "Street" feature, only contains 2 values, so we will transfet it into binary "Is_Pave_Street feature"
print(set(_df["Street"]))
_df["_IsPaveStreet"] = (_df["Street"]=='Pave').astype(int)
_df = _df.drop("Street", axis=1)

In [None]:
# for "PavedDrive" we will manually endoe lables. Partially paved will be 0.5
_df.loc[:,"PavedDrive"] = _df["PavedDrive"].replace({"Y":1,"N":0,"P":0.5}) 

## <a class="anchor" style="color:#8f6e60"> Neighborhood features</a> 
---

In [None]:
# This one may be inaccurate, cause some of the loaction names are umbiquose
ll = {"Blmngtn":(40.480592,-89.033689),"Blueste":(47.5248776,-118.1266345),"BrDale":(36.5723285,-82.1790214),"BrkSide":(39.66706, -75.72688),
      "ClearCr":(39.645833,-111.151667),"CollgCr":(37.225412, -76.693987),"Crawfor":(42.683024,-103.405479),"Edwards":(39.64499, -106.5942),
      "Gilbert":(33.35283, -111.78903),"IDOTRR":(41.6613, -91.5299),"MeadowV":(40.0172943, -81.6192906),"Mitchel": (43.70943, -98.0298),"NAmes":(42.034722, -93.62),
      "NoRidge":(34.22834, -118.53675),"NPkVill":(32.580697,-92.0804111),"NridgHt":(48.218016,-114.3329096),"NWAmes":(42.034722, -93.62),
      "OldTown":(29.601657, -82.981928),"SWISU":(42.023949, -93.647595),"Sawyer":(45.907319, -91.320396),"SawyerW": (46.333407, -87.365986),
      "Somerst":(40.4976, -74.48849),"StoneBr":(35.3465056, -82.4917868),"Timber":(45.6760608,-92.1060166),"Veenker": (42.0414857,-93.6501622),}

In [None]:
for k in ll:
    _msk = _df["Neighborhood"]==k
    _df.loc[_msk,["_Lat","_Lon"]] = ll[k]
    
    _target = _df.loc[_msk,"target"]
    _df.loc[_msk,"mean"] =  _target.mean()
    _df.loc[_msk,"max"]  =  _target.max()

In [None]:
_df["max"] = MinMaxScaler().fit_transform(_df["max"].to_numpy().reshape(-1,1))
fig = px.scatter_mapbox(_df[~nan_mask], lat="_Lat", lon="_Lon", size="max", hover_name="Neighborhood", color="mean", zoom=3, height=500,color_continuous_scale=gradient)
fig.update_layout(mapbox_style='carto-positron')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
_df["_Lat"] = MinMaxScaler().fit_transform(_df["_Lat"].to_numpy().reshape(-1,1))
_df["_Lon"] = MinMaxScaler().fit_transform(_df["_Lon"].to_numpy().reshape(-1,1))

In [None]:
px.box(_df, color="Neighborhood", y="target", height=400, color_discrete_sequence=palette)

In [None]:
_df = _df.drop(["mean","max"],axis=1)
_df["_ExpenciveNh"] = _df["Neighborhood"].isin(["NoRidge","NridgHt","StoneBr"]).astype(int)
_df = replace_dummies("Neighborhood",_df)

## <a class="anchor" style="color:#8f6e60"> Conditions features</a>
---

In [None]:
encoder = OneHotEncoder(sparse=True)
cond = encoder.fit_transform(_df["Condition1"].to_numpy().reshape(-1,1)).toarray()
cond += encoder.transform(_df["Condition2"].to_numpy().reshape(-1,1)).toarray()
cond = cond.astype(bool).astype(int)
cond = pd.DataFrame(data=cond,columns=encoder.categories_, index=_df.index)
cond.columns = [i[0] for i in cond.columns]

In [None]:
_df = _df.drop(["Condition1","Condition2"], axis=1)
_df = pd.concat([_df,cond], axis=1)

In [None]:
save_to_df("environment", _df)
_df.head(1)

## <a class="anchor" id="1.3_bullet" style="color:#b14404"> 1.3 Selling terms features </a>
---------------------

In [None]:
attr_list,_df = attribute_slice("sellterms")

In [None]:
_df =replace_dummies("SaleCondition",_df)

## <a class="anchor" style="color:#8f6e60"> YrSold and MoSold feature </a>
---

In [None]:
ym_df = _df.groupby(["YrSold","MoSold"],as_index=False).sum()
ym_df["&"] = ym_df["YrSold"].astype(str)+"."+ym_df["MoSold"].astype(str)
fig = px.line(ym_df, x="&", y="target", color_discrete_sequence=palette)
fig.add_hline(y=ym_df["target"].mean(), line_dash="dash", line_color=palette[1])

In [None]:
_df["_HighDemand"] = ym_df["MoSold"]
_df.loc[:,"_HighDemand"] = _df["MoSold"].replace([1,2,3,9,10,11,12],0)
_df.loc[:,"_HighDemand"] = _df["MoSold"].replace([4,5,6,7,8],1)

In [None]:
_df["YrSold"] = _df["YrSold"].astype(str)
_df =replace_dummies("YrSold",_df)

## <a class="anchor" style="color:#8f6e60"> SaleType feature </a>
---

For this feature wew will just use the normality test for distrebutions, and shrink those values, that influence target equally.

In [None]:
_df["SaleType"] = _df["SaleType"].fillna("Oth")
_df["SaleType"] = _df["SaleType"].astype(str)

# Warranty deed
_df["_Deed"] = _df["SaleType"].isin(["WD","CWD","VWD","COD"]).astype(int)
#Contract
_df["_Contract"] = _df["SaleType"].isin(["Con","ConLw","ConLI","ConLD"]).astype(int)

_df =replace_dummies("SaleType",_df)

In [None]:
save_to_df("sellterms", _df)
_df.head(1)

## <a class="anchor" id="1.4_bullet" style="color:#b14404"> 1.4 Land plot features </a>
---

In [None]:
attr_list,_df = attribute_slice("landplot")

In [None]:
# We will scale Lot Area, to check the distribution
_df["LotArea"] = QuantileTransformer().fit_transform(_df["LotArea"].values.reshape(-1,1))
px.scatter(_df, x="LotArea", y="target", color="target", color_continuous_scale=gradient, height=400)

In [None]:
for k in ["LotShape","LandContour","LotConfig","LandSlope"]:
    _df = replace_dummies(k,_df)

In [None]:
save_to_df("landplot", _df)
_df.head(1)

## <a class="anchor" id="1.5_bullet" style="color:#b14404"> 1.5 Dwelling features </a>
---

In [None]:
attr_list,_df = attribute_slice("dwelling")

In [None]:
_df["_WasRemod"] = (_df["YearBuilt"] != _df["YearRemodAdd"]).astype(int)

In [None]:
# I shrink Exterior features, with a small lost of informations i hope
e = 3
_df["_Exterior"] = _df["Exterior1st"] + "_" + _df["Exterior2nd"]
_df.loc[:,["_Exterior"]] = _df.loc[:,["_Exterior"]].fillna("extra")
_df = _df.drop(["Exterior1st","Exterior2nd"],axis=1)
_a = _df["_Exterior"].value_counts()
to_replace = _a[_a==1].index.values
for i in _a[_a<=e].index.values:
    _df = _df.replace({i:"extra"})
px.box(_df[~nan_mask], color="_Exterior", y="target", color_discrete_sequence=gradient, height=400)

In [None]:
_df.loc[:,["MasVnrType"]] = _df.loc[:,["MasVnrType"]].fillna("None")
_df["_HasVnr"] = (_df["MasVnrType"]!="None").astype(int)

_df.loc[:,["MasVnrArea"]] = _df.loc[:,["MasVnrArea"]].fillna(0)
_df["MasVnrArea"] = MinMaxScaler().fit_transform(_df["MasVnrArea"].values.reshape(-1,1))

_df.loc[:,["BsmtFullBath"]] = _df.loc[:,["BsmtFullBath"]].fillna(0)
_df.loc[:,["BsmtHalfBath"]] = _df.loc[:,["BsmtHalfBath"]].fillna(0)

_df["_Bath"] = 0.5 * (_df["BsmtHalfBath"] + _df["HalfBath"]) + _df["BsmtFullBath"] + _df["FullBath"]

In [None]:
for f in ["BldgType","HouseStyle","_Exterior","Foundation","RoofStyle","MasVnrType","MSSubClass","RoofMatl"]:
    _df = replace_dummies(f,_df)

## <a class="anchor" style="color:#8f6e60"> General dwelling type feature </a>
---

In [None]:
tsne = TSNE(n_components=2, perplexity=50, early_exaggeration=5,random_state=0)
X = tsne.fit_transform(_df.drop(["target"],axis=1))
X = pd.DataFrame(data=X, columns=["DwellTSNE1","DwellTSNE2"], index=_df.index)

In [None]:
gmm = DBSCAN(4)
clusters = pd.Series(gmm.fit_predict(X), index=_df.index)[~nan_mask]+1

In [None]:
fig = px.scatter(x=X.loc[~nan_mask,"DwellTSNE1"], 
                 y=X.loc[~nan_mask,"DwellTSNE2"], 
                 color=clusters, size_max=7,
                 color_continuous_scale=gradient)
fig.show()

In [None]:
_df

In [None]:
save_to_df("dwelling", _df)
_df.head(1)

## <a class="anchor" id="1.6_bullet" style="color:#b14404"> 1.6 Benefits features </a>
---

In [None]:
attr_list,_df = attribute_slice("benefits")

In [None]:
_df["_GasHeating"] = _df["Heating"].isin(["GasA","GasW"]).astype(int)
px.box(_df[~nan_mask], color="_GasHeating", y="target", color_discrete_sequence=palette, height=400)

In [None]:
_df["CentralAir"] = (_df["CentralAir"]=="Y").astype(int)
_df["Electrical"] = _df["Electrical"].fillna("NA")
_df["Utilities"] = (_df["Utilities"]!="AllPub").astype(int)
_df["Functional"] = _df["Functional"].fillna("Typ")

In [None]:
for k in ["Functional","Electrical","Heating"]:
    _df =replace_dummies(k,_df)

In [None]:
save_to_df("benefits", _df)
_df.head(1)

## <a class="anchor" id="1.7_bullet" style="color:#b14404"> 1.7 Square feet features </a>

In [None]:
attr_list,_df = attribute_slice("sqrfeets")

In [None]:
_df["_PorchSF"] = _df[["WoodDeckSF","OpenPorchSF","EnclosedPorch","3SsnPorch","ScreenPorch"]].sum(axis=1)
_df["_PorchSF"] = MinMaxScaler().fit_transform(_df["_PorchSF"].values.reshape(-1,1))
_df["_NoPorch"] = (_df["_PorchSF"]==0).astype(int)
px.scatter(_df,x="_PorchSF",y="target",color="target",color_continuous_scale=gradient)

In [None]:
_df["_NoPool"] = (_df["PoolArea"]==0).astype(int)
_df["_1to2floorSF"] = _df["2ndFlrSF"]/_df["1stFlrSF"]
_df["_No2floor"] = (_df["_1to2floorSF"]==0).astype(int)

In [None]:
_df["1stFlrSF"] = MinMaxScaler().fit_transform(_df["1stFlrSF"].values.reshape(-1,1))
_df["2ndFlrSF"] = MinMaxScaler().fit_transform(_df["2ndFlrSF"].values.reshape(-1,1))
_df["GrLivArea"] = MinMaxScaler().fit_transform(_df["GrLivArea"].values.reshape(-1,1))
_df["OpenPorchSF"] = MinMaxScaler().fit_transform(_df["OpenPorchSF"].values.reshape(-1,1))

In [None]:
save_to_df("sqrfeets", _df)
_df.head(1)

## <a class="anchor" id="1.8_bullet" style="color:#b14404"> 1.8 Basement features </a>

In [None]:
attr_list,_df = attribute_slice("basement")

In [None]:
_df.loc[_df["BsmtFinSF1"].isna(),["BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF","TotalBsmtSF"]] = 0

In [None]:
_df['_Nobsmt'] = (_df["TotalBsmtSF"]==0).astype(int)

In [None]:
_df["BsmtUnfSF"] = MinMaxScaler().fit_transform(_df["BsmtUnfSF"].values.reshape(-1,1))
_df["TotalBsmtSF"] = MinMaxScaler().fit_transform(_df["TotalBsmtSF"].values.reshape(-1,1))

## <a class="anchor" style="color:#8f6e60"> Basement summary feature </a>
---

In [None]:
_model = LinearRegression()
_model.fit(_df[~nan_mask][attr_list],_df[~nan_mask]["target"])
_df["_BasementSummary"] = QuantileTransformer().fit_transform(_model.predict(_df[attr_list]).reshape(-1,1))

In [None]:
pltdf = _df[~nan_mask].copy()
fig = px.scatter(pltdf, x="_BasementSummary", y="target",color="target",height=400, color_continuous_scale=gradient)
fig.show()

In [None]:
save_to_df("basement", _df)
_df.head(1)

## <a class="anchor" id="1.9_bullet" style="color:#b14404"> 1.9 Garage features </a>
---

In [None]:
attr_list,_df = attribute_slice("garage")

In [None]:
_df["GarageType"] = _df["GarageType"].fillna("NA")

_msk =_df["GarageYrBlt"].isna()
_df.loc[_msk,"GarageYrBlt"] = df.loc[_msk,"YearBuilt"]
_df.loc[_msk,"GarageFinish"] = 'Fin'
_df.loc[_df["GarageCars"].isna(),"GarageCars"] = _df.mode()["GarageCars"].values[0]
_df.loc[_df["GarageArea"].isna(),"GarageArea"] = _df.mode()["GarageArea"].values[0]

_df["_NoGarage"] = (_df["GarageArea"] == 0).astype(int)

In [None]:
px.box(_df[~nan_mask], color="_NoGarage", y="target", color_discrete_sequence=palette, height=400)

In [None]:
_df["GarageArea"] = QuantileTransformer().fit_transform(_df["GarageArea"].to_numpy().reshape(-1,1))
px.scatter(_df[~nan_mask], x="GarageArea", y="target", color="target", color_continuous_scale=gradient, height=400)

In [None]:
for f in ["GarageType","GarageFinish"]:
    _df = replace_dummies(f,_df)

In [None]:
save_to_df("garage", _df)
_df.head(1)

## <a class="anchor" id="1.10_bullet" style="color:#b14404"> 1.10 Other features </a>
---

In [None]:
attr_list,_df = attribute_slice("other")

In [None]:
plt_df = _df.copy()
plt_df["MiscFeature"] = plt_df["MiscFeature"].astype(str)
px.box(plt_df[~nan_mask], color="MiscFeature", y="target", color_discrete_sequence=palette, height=400)

In [None]:
_df.loc[_df["MiscFeature"] == "TenC","MiscFeature"] = np.nan

In [None]:
_df= replace_dummies("MiscFeature",_df)

In [None]:
save_to_df("other", _df)
_df.head(1)

## <a class="anchor" id="1.11_bullet" style="color:#b14404"> 1.11 Clusterization </a>
---

## <a class="anchor" style="color:#8f6e60"> Clusters </a>
---

In [None]:
_df = df.copy(deep=True)
tsne = TSNE(n_components=2, perplexity=50, early_exaggeration=12,random_state=0)
X = tsne.fit_transform(_df.drop(["SalePrice","SalePrice_transformed"],axis=1))
X = pd.DataFrame(data=X, columns=["DwellTSNE1","DwellTSNE2"], index=_df.index)

In [None]:
gmm = DBSCAN(7, min_samples=10)
_df = df.copy(deep=True)
color = pd.Series(gmm.fit_predict(X), index=_df.index)[~nan_mask]+1

In [None]:
fig = px.scatter(x=X.loc[~nan_mask,"DwellTSNE1"], 
                 y=X.loc[~nan_mask,"DwellTSNE2"], 
                 color= color, 
                 size_max=7,
                 color_continuous_scale=gradient)
fig.show()

In [None]:
df["_cluster"] = gmm.fit_predict(X)

# <a class="anchor" id="2_bullet" style="color:#2f4c28"> 2. Modeling </a>
## <a class="anchor" id="2.1_bullet" style="color:#b14404"> 2.1 Data Preprocessing </a>
---

In [None]:
print(f"Number of attributes: {len(df.columns)}")

In [None]:
# Pruning of the features with less then 30 non-zero values
for c in df.columns:
    if (len(df) - df[df[c]==0][c].count()) < 30:
        df = df.drop(c,axis=1)
print(f"Number of attributes (Pruned): {len(df.columns)}")

In [None]:
X = df.loc[~df["SalePrice_transformed"].isna(),:]
X["outlier"] = IsolationForest(random_state=0, n_estimators=1000).fit_predict(X)*-1+2
df = df.drop(X[X["outlier"]!=1].index)
px.scatter(X, y="SalePrice_transformed", color="outlier", size="outlier",color_continuous_scale=gradient,size_max=7)

In [None]:
_df = df.drop("SalePrice",axis=1)
_df.index = _df.index+1
mask = _df["SalePrice_transformed"].isna()

In [None]:
train, deploy = _df.loc[~mask,:], _df.loc[mask,:]
models = {}

In [None]:
x_tr, y_tr = train.drop("SalePrice_transformed", axis=1), train["SalePrice_transformed"] 

## <a class="anchor" id="2.2_bullet" style="color:#b14404"> 2.2 Linear models </a>
## <a class="anchor" style="color:#8f6e60"> Linear Regression</a>
---

In [None]:
name = "lin"
models[name] = LinearRegression().fit(x_tr, y_tr)
print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv("lin.csv", index=False)
# "lin" : 0.12696

## <a class="anchor" style="color:#8f6e60"> LassoCV Regression</a>
---

In [None]:
name= "lassocv"

_grid = {"alphas":np.arange(0.0005,0.0015,0.0001)}
_cv =ShuffleSplit(n_splits=30, test_size=400, random_state=0)
models[name] = LassoCV(cv=_cv,**_grid)
models[name].fit(x_tr, y_tr)
print(f"alpha for the best model: {models[name].alpha_}")
print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "lassocv" : 0.12479 

## <a class="anchor" style="color:#8f6e60"> RidgeCV Regression</a>
---

In [None]:
name= "ridgecv"

_grid = {"alphas":np.arange(0.4,0.5,0.001)}
_cv =ShuffleSplit(n_splits=50, test_size=400, random_state=0)
models[name] = RidgeCV(cv =_cv, **_grid)
models[name].fit(x_tr, y_tr)
print(f"alpha for the best model: {models[name].alpha_}")
print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "ridgecv" : 0.12686

## <a class="anchor" style="color:#8f6e60"> ElasticNet </a>
---

In [None]:
name= "elasticcv"
_grid = {"alphas" : np.arange(0.0001,0.1,0.0001),
         "l1_ratio" : [.1, .5, .7, .9, .95, .96, .97,.98, .99, 1]}
_cv =ShuffleSplit(n_splits=30, test_size=400, random_state=0)
models[name] = ElasticNetCV(cv =_cv,max_iter=1e5, **_grid)
models[name].fit(x_tr, y_tr)
print(f"alpha for the best model: {models[name].alpha_}")
print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "elasticcv" : 0.12479

## <a class="anchor" id="2.3_bullet" style="color:#b14404"> 2.3 Advanced models </a>
## <a class="anchor" style="color:#8f6e60"> GradientBoostingRegressor </a>
---

In [None]:
#name= "sk_gbr"
#_grid = {"subsample":[0.4,0.3],
#         "max_depth":[5],
#         "min_samples_split":[15,],
#         "min_samples_leaf":[5],
#         "learning_rate":np.arange(0.05,0.2,0.001)}

#model = GradientBoostingRegressor(verbose=False,random_state=0, loss='huber')
#_cv =ShuffleSplit(n_splits=30, test_size=400, random_state=0)
#clf = GridSearchCV(estimator=model, param_grid=_grid, cv=_cv, verbose=2)
#clf.fit(x_tr, y_tr) 
#clf.best_params_

In [None]:
name= "sk_gbr"
params = {'learning_rate': 0.05,'max_depth': 5,'min_samples_leaf': 5,'min_samples_split': 15,'subsample': 0.4}
models[name] = GradientBoostingRegressor(verbose=False,random_state=0, loss='huber', **params)
models[name].fit(x_tr, y_tr)
print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "sk_gbr" : 0.13215

## <a class="anchor" style="color:#8f6e60"> LightGBM </a>
---

In [None]:
#name= "lightgbm"
#_grid = {"learning_rate": [0.018999999999999996],
#         "n_estimators":[500,5000],
#         "min_split_gain":[0],
#         "min_child_samples":[20],
#         "subsample":[1],
#         "reg_alpha":[0.03],
#         "reg_lambda":[0.02]}

#model = LGBMRegressor(silent=True, random_state=0, max_depth=-1)
#_cv =ShuffleSplit(n_splits=30, test_size=400, random_state=0)
#clf = GridSearchCV(estimator=model, param_grid=_grid, cv=_cv, verbose=2)
#clf.fit(x_tr, y_tr)

In [None]:
name= "lightgbr"
params = {'learning_rate': 0.018999999999999996,'min_child_samples': 20,'min_split_gain': 0,
          'n_estimators': 500,'reg_alpha': 0.03,'reg_lambda': 0.02,'subsample': 1}
models[name] = LGBMRegressor(silent=True, random_state=0, max_depth=-1, **params)
models[name].fit(x_tr, y_tr)

print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "lightgbr" : 0.12948

## <a class="anchor" style="color:#8f6e60"> XGBoost </a>
---

In [None]:
#_grid = {"max_depth":[6],
#         "learning_rate":[0.022985,0.02299,0.022995],
#         "subsample":[0.5],
#         "reg_alpha":[0.0011,0.001],
#         "reg_lambda":[0.405,0.41,0.415],
#         "n_estimators": [500]}

#model = XGBRegressor(random_state=0,**{'tree_method': 'gpu_hist', 'max_bin': 16, 'gpu_id': 0})
#_cv =ShuffleSplit(n_splits=30, test_size=400, random_state=0)
#clf = GridSearchCV(estimator=model, param_grid=_grid, cv=_cv, verbose=2)
#clf.fit(x_tr, y_tr)

In [None]:
name= "xgbm"
params = {'learning_rate': 0.02299,'max_depth': 6,'n_estimators': 500,'reg_alpha': 0.001,'reg_lambda': 0.41,'subsample': 0.5}
models[name] = XGBRegressor(random_state=0, **params)
models[name].fit(x_tr, y_tr)

print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "xgbm" : 0.12843

## <a class="anchor" style="color:#8f6e60"> CatBoostRegressor </a>
---

In [None]:
#name= "cat"
#_grid = {"l2_leaf_reg":(0.0038,0.004,0.0042),
#         "rsm":(0.05,0.1,0.25),
#         "learning_rate":np.arange(0.0108,0.011,0.00001)} 

#model = CatBoostRegressor(verbose=False, iterations=5000, depth=6)
#_cv =ShuffleSplit(n_splits=50, test_size=1400, random_state=0)
#model.grid_search(_grid, X=x_tr, y=y_tr, cv=_cv, plot=True, verbose=False)

In [None]:
name= "cat"
params = {"rsm":0.05, "depth":6, "learning_rate": 0.0109, "l2_leaf_reg": 0.004}
models[name] = CatBoostRegressor(verbose=False, iterations=5000, **params)
models[name].fit(x_tr, y_tr)

print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "cat" : 0.12245

## <a class="anchor" id="2.4_bullet" style="color:#b14404"> 2.4 Mixture models </a>
## <a class="anchor" style="color:#8f6e60"> Linear Mixture </a>
---

In [None]:
name = "lin_mix"
names = ["elasticcv","ridgecv","lassocv","lin"]
models[name] = StackingCVRegressor(regressors= set(models[k] for k in names),
                                   meta_regressor=CatBoostRegressor(verbose=False),
                                   use_features_in_secondary=True)
models[name].fit(x_tr, y_tr)

print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "lin_mix" : 0.11957

## <a class="anchor" style="color:#8f6e60"> Advanced Mixture </a>
---

In [None]:
name = "adv_mix"
names = ["xgbm","lightgbr","sk_gbr","cat"]
models[name] = StackingRegressor(regressors= set(models[k] for k in names),
                                   meta_regressor=CatBoostRegressor(verbose=False),
                                   use_features_in_secondary=True)
models[name].fit(x_tr, y_tr)

print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "adv_mix" : 0.12678

## <a class="anchor" style="color:#8f6e60"> Mixture of Mixtures </a>
---

In [None]:
name = "mix_mix"
names = ["lin_mix","adv_mix"]
models[name] = StackingRegressor(regressors= set(models[k] for k in names),
                                   meta_regressor=CatBoostRegressor(verbose=False),
                                   use_features_in_secondary=True)
models[name].fit(x_tr, y_tr)

print(f"SCORE:{models[name].score(x_tr, y_tr)}")

d = models[name].predict(deploy.drop(["SalePrice_transformed"], axis=1))
d  = target_scaler.inverse_transform(d.reshape(-1,1)).squeeze()
d = pd.DataFrame({"Id":deploy.index,"SalePrice":d})
d.to_csv(f"{name}.csv", index=False)
# "mix_mix" : 0.12246

## <a class="anchor" id="2.5_bullet" style="color:#b14404"> 2.5 Avaraging (may not be the best one)</a>
---

In [None]:
dep_acc = {"lin" : 0.12696, "lassocv" : 0.12479, "ridgecv" : 0.12686, "elasticcv" : 0.12479, 
           "sk_gbr" : 0.13100, "lightgbr" : 0.12701, "xgbm" : 0.12985, "cat" : 0.12638,
           "lin_mix" : 0.11957, "adv_mix" : 0.12678, "mix_mix" : 0.12246}
scaled = MinMaxScaler(feature_range=(0,1)).fit_transform((1-np.array(list(dep_acc.values()))).reshape(-1,1))
scaled = scaled/sum(scaled)
dep_acc = {k:scaled[i][0] for i,k in enumerate(dep_acc)}
dep_acc

In [None]:
deploy = pd.read_csv("cat.csv")
deploy["SalePrice"]= 0
for k in dep_acc:
    _d = pd.read_csv(f"{k}.csv")
    _d["SalePrice"] = _d["SalePrice"]*dep_acc[k]
    deploy["SalePrice"] = deploy["SalePrice"] + _d["SalePrice"]

In [None]:
d.to_csv(f"final.csv", index=False)
# "final" : 0.12226