<a href="https://www.kaggle.com/code/sachinpatil1280/housing-prices-prediction-top-0-2?scriptVersionId=144394480" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Import Modules

In [None]:
# Basic
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Visualization
import seaborn as sns
import sklearn_pandas

# Encoding
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin, clone
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, Normalizer, StandardScaler, OneHotEncoder

# Models
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.linear_model import ElasticNet, Lasso, ElasticNetCV,LinearRegression
from sklearn.ensemble import  GradientBoostingRegressor
import lightgbm as lgb

# metrics
from sklearn.metrics import mean_squared_error,accuracy_score

# Warning
import warnings
warnings.filterwarnings('ignore')

# Import train and test data

In [None]:
train = pd.read_csv('/kaggle/input/home-data-for-ml-course/train.csv')
test =pd.read_csv('/kaggle/input/home-data-for-ml-course/test.csv')

In [None]:
data = pd.concat([train,test])
data.head()

In [None]:
print(train.shape,test.shape,data.shape)

# Exploratory Data Analysis and Data tidying

I noticed that the dataset consists of 38 numerical columns of 81 all. Morover, some columns contain missing data, i.e."LotFrontage", "MasVnrArea".

The prediction target is 'SalePrice'. Let's look closer to this column.

## Basic Summary

In [None]:
train.SalePrice.describe()

In [None]:
# Box_plot for SalePrice
plt.figure(figsize=(14,5))
sns.boxplot(data = train,x='SalePrice')
plt.tight_layout()

There are two outliers with prices more than 700000.

In [None]:
# The Density Plot of SalePrice
plt.figure(figsize=(14,5))
sns.set_style('darkgrid')
sns.histplot(data= train,x='SalePrice',bins=50,kde=True)
plt.title("Density plot of SalePrice Before Log Transformation")
plt.tight_layout()
plt.xticks(rotation=50)
plt.show()

I noticed that it is right-skewed distribution with the pick around 160k and quite long tail with maximum about 800k.

In [None]:
# Positive Skeweness or Right Skew
train.SalePrice.skew()

In [None]:
# In order to be SalePrice more normal I do Logarithm transformation.
train['SalePrice'] = np.log1p(train['SalePrice'])

In [None]:
# Density plot for SalePrice after log transformation.
plt.figure(figsize=(14,5))
sns.histplot(train['SalePrice'],kde=True,bins= 50)
plt.title("Density plot of SalePrice after Log Transformation")
plt.tight_layout()
plt.show()

# Missing Values

In [None]:
# Number of missing values in each column count 
col = train.isna().sum()
col_na = pd.DataFrame({'Column': col.index,'Count':col.values}).sort_values(by='Count',ascending=False)
col_na.head(20)

In [None]:
# Visual Representaion of top 20 columns with missing values
sns.set(font_scale=1.2)
plt.figure(figsize=(14,5))
sns.barplot(data=col_na.head(20),x='Column',y='Count')
plt.xticks(rotation=50)
plt.tight_layout()

In [None]:
# Percentage of missing values 
col_na['Percent_nan'] = (col_na['Count']/train.shape[0])*100
col_na

In [None]:
# visual for percentage of missing values
plt.figure(figsize=(14,5))
sns.set(font_scale=1.2)
sns.barplot(data=col_na.head(20),x='Column',y='Percent_nan')
plt.xticks(rotation=50)
plt.tight_layout()

**Removing Id column**

In [None]:
train = train.drop(columns='Id')
test = test.drop(columns='Id')

**Removing columns that contain the same value in 100%**


According to basic statistics provided on Kaggle competiton website, the columns Street and Utilities contain only one value "Pave" and "AllPub" respectively.

In [None]:
print(train['Street'].value_counts())
print(train['Utilities'].value_counts())

In [None]:
train = train.drop(columns=['Street','Utilities'])
test = test.drop(columns= ['Street','Utilities'])

# Removing outliers

Removing outliers is important step in data analysis. However, while removing outliers in ML we should be careful, because we do not know if there are not any outliers in test set.

I just made a plot for SalePrice and GrLivArea and removed those which seems to be outliers.

**GrLivArea Feature**

In [None]:
# plot for SalePrice vs GrLivArea
plt.figure(figsize=(14, 5))
sns.set(font_scale=1.2)
sns.scatterplot(data= data, y='SalePrice',x='GrLivArea')
plt.title("GrLivArea vs SalePrice")
plt.tight_layout()
plt.show()

In [None]:
# I decided to remove those records where 'GrLivArea' is more than 4500. We can see on plot that they have a vey low price.
clear_data = data.drop(data[(data['GrLivArea']>4500)].index)

In [None]:
# Concatenate all data together - both train and test
train_ = clear_data.drop(['SalePrice'], axis=1)
all_data = pd.concat([train, test]).reset_index(drop=True)

**GarageYrBlt feature**

I checked if there are records that YearBuilt or GarageYrBlt have further year than 2017.

In [None]:
all_data[all_data['GarageYrBlt']>2017]['GarageYrBlt']
#It seems like it is a typo

In [None]:
# change the typo to 2007
all_data.loc[2592,'GarageYrBlt'] = 2007

**LotFrontage feature**

LotFrontage is a linear feet of street connected to property. I think it is a high probability that these values are similar to houses in the same Neighborhood. I check some statistics for them.

In [None]:
# plot for SalePrice vs LotFrontage
plt.figure(figsize=(14,5))
sns.scatterplot(data = all_data,x='LotFrontage',y='SalePrice')
plt.title("LotFrontage vs SalePrice")
plt.tight_layout()
plt.show()

In [None]:
all_data['Neighborhood']

In [None]:
nei_lot = all_data.groupby('Neighborhood')['LotFrontage'].agg(['mean','median'])
nei_lot['avg_mean_median'] = (nei_lot['mean']+nei_lot['median'])/2
nei_lot.head().sort_values(by='avg_mean_median',ascending =False)

In [None]:
# transformation into medians
all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

# Feature Enginering

**Transformation of some numerical variables that are actually categorical**

In [None]:
def convert_to_string(df, columns):
    df[columns] = df[columns].astype(str)
    return df


num_to_categ_features = ['MSSubClass', 'OverallCond']
all_data = convert_to_string(all_data, columns = num_to_categ_features)

**Replacing missing values in the rest of numerical columns**

For the other numerical data I will also estimate them according to their statistics and for that I will use SimpleImputer object from sklearn library. For columns: BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, BsmtFullBath and BsmtHalfBath , MasVnrArea I will fill Nan values with constant = 0 and for the rest with median.

In [None]:
# define 3 variables for replacing missing values
num_features = all_data.select_dtypes(include=['int64','float64']).columns
num_features_to_constant = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', "MasVnrArea"] 
num_features_to_median = [feature for feature in num_features if feature not in num_features_to_constant + ["SalePrice"]]

In [None]:
# Generating numerical features as input to DataFrameMapper.  
numeric_features_median = sklearn_pandas.gen_features(columns=[num_features_to_median], 
                                               classes=[{'class': SimpleImputer, 
                                                         'strategy': 'median', 
                                                         'missing_values' : np.nan}])

numeric_features_zero = sklearn_pandas.gen_features(columns=[num_features_to_constant], 
                                               classes=[{'class': SimpleImputer, 
                                                         'strategy': 'constant',
                                                         'fill_value' : 0, 
                                                         'missing_values' : np.nan}])

missing_val_imputer = sklearn_pandas.DataFrameMapper(numeric_features_median + numeric_features_zero)

In [None]:
# Fitting
imputed_median = missing_val_imputer.fit(all_data)

# Transformation
imputed_features = imputed_median.transform(all_data)

# Putting into dataframe
imputed_df = pd.DataFrame(imputed_features, index=all_data.index, columns=num_features_to_median + num_features_to_constant)

In [None]:
imputed_df.head()

**Categorical to numeral**

There is a lot of categorical features in data, so the next step is to transform them into numerical values.

In [None]:
# Selecting category features
cat_feats = all_data.select_dtypes(include=['object']).columns
cat_feats

According to data description (possible values for each feature) on Kaggle, I created a list of conversion values, specific for each column.

I implemented the none_transform function which converts missing categorical values into specific strings from the none_conversion dictionary.


In [None]:
none_conversion = [("MasVnrType","None"),
                  ("BsmtQual","NA"), 
                  ("Electrical", "SBrkr"),
                  ("BsmtCond","TA"),
                  ("BsmtExposure","No"),
                  ("BsmtFinType1","No"),
                  ("BsmtFinType2","No"),
                  ("CentralAir","N"),
                  ("Condition1","Norm"), 
                  ("Condition2","Norm"),
                  ("ExterCond","TA"),
                  ("ExterQual","TA"), 
                  ("FireplaceQu","NA"),
                  ("Functional","Typ"),
                  ("GarageType","No"), 
                  ("GarageFinish","No"), 
                  ("GarageQual","NA"), 
                  ("GarageCond","NA"), 
                  ("HeatingQC","TA"), 
                  ("KitchenQual","TA"), 
                  ("Functional","Typ"), 
                  ("GarageType","No"), 
                  ("GarageFinish","No"), 
                  ("GarageQual","No"), 
                  ("GarageCond","No"), 
                  ("HeatingQC","TA"), 
                  ("KitchenQual","TA"),
                  ("MSZoning", "None"),
                  ("Exterior1st", "VinylSd"), 
                  ("Exterior2nd", "VinylSd"), 
                  ("SaleType", "WD")]


In [None]:
def none_transform(df, conversion_list):
    ''' Function that converts missing categorical values 
    into specific strings according to "conversion_list" 
    
    Returns the dataframe after transformation.
    '''
    for col, new_str in conversion_list:
        df.loc[:, col] = df.loc[:, col].fillna(new_str)
    return df


In [None]:
# Applying the "none_transform" function 
all_data = none_transform(all_data, none_conversion)
len(all_data.columns)

**Transformation of skewed features**

As for linear models preferable are normally distributed data, I am transforming the skewed features to make them more normally distributed.

In [None]:
# collecting the numeric features without considering SalePrice
numeric_features = [feat for feat in num_features if feat not in ['SalePrice']] 

In [None]:
# selecting columns with skew more than 0.5
skewed_features = all_data[num_features].apply(lambda x: x.dropna().skew())
skewed_features = skewed_features[skewed_features > 0.5].index
print("\nHighly skewed features: \n\n{}".format(skewed_features.tolist()))

In [None]:
# Applying log-transformation 
all_data[skewed_features] = np.log1p(all_data[skewed_features])
test[skewed_features] = np.log1p(test[skewed_features])

In [None]:
"""A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape.
The “optimal lambda” is the one that results in the best approximation of a normal distribution curve. 
I selected lambda= 0.15."""

#lambda_ = 0.15
#for feature in skewed_features:
    #all_data[feature] = boxcox1p(all_data[feature], lambda_)

**Categorical into Numerical**

As some categorical features (i.e. KitchenQual, GarageQual) can be transformed into the numerical values with some order, I also implemented a new encoder for them.


In [None]:
class OrderedLabelTransformer(BaseEstimator, TransformerMixin):
    orderDict = {"NA" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
    
    @staticmethod
    def get_dict(X):
        FirstDict = {"Po" : 0, "Fa" : 1, "TA" : 2, "Gd" : 3, "Ex" : 4}
        SecondDict = {"NA" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5}
        ThirdDict = {"NA" : 0, "Fa" : 1, "TA" : 2, "Gd" : 3, "Ex" : 4}
        for d in [FirstDict, SecondDict, ThirdDict]:
            if set(X) == set(d): 
                return d
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        def get_label(t):
            return self.orderDict[t]
        return np.array([get_label(n) for n in X])

In [None]:
class NeighborhoodTransformer(BaseEstimator, TransformerMixin):
    neighborhoodsmap = {'StoneBr' : 2, 'NridgHt' : 2, 'NoRidge': 2, 
                        'MeadowV' : 0, 'IDOTRR' : 0, 'BrDale' : 0 ,
                        'CollgCr': 1, 'Veenker' : 1, 'Crawfor' : 1,
                        'Mitchel' : 1, 'Somerst' : 1, 'NWAmes' : 1,
                        'OldTown' : 1, 'BrkSide' : 1, 'Sawyer' : 1, 
                        'NAmes' : 1, 'SawyerW' : 1, 'Edwards' : 1,
                        'Timber' : 1, 'Gilbert' : 1, 'ClearCr' : 1,
                        'NPkVill' : 1, 'Blmngtn' : 1, 'SWISU' : 1,
                        'Blueste': 1}

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        def get_label(t):
            return self.neighborhoodsmap[t]
        return np.array([get_label(n) for [n] in X])

In [None]:
# Generating features:
order_feats = ["ExterQual", "ExterCond", "HeatingQC", "KitchenQual", "BsmtQual","BsmtCond", "FireplaceQu", "GarageQual", "GarageCond"]

original_features_df = all_data[order_feats + ['Neighborhood']] # we need to save original values for one-hot encoding

order_features = sklearn_pandas.gen_features(order_feats, [OrderedLabelTransformer])
neighb_features = [(['Neighborhood'], [NeighborhoodTransformer()])]

In [None]:
# Pipeline
label_encoder = sklearn_pandas.DataFrameMapper(neighb_features + order_features)

In [None]:
# The list with order of column names
cols = ["Neighborhood"] + order_feats

# Transformation both train and test set
transformed_feats = label_encoder.fit_transform(all_data)

# Putting transformed features into dataframe
transformed_df = pd.DataFrame(transformed_feats, index=all_data.index, columns=cols)
original_features_df.shape

In [None]:
# feature without any transformation till now
rest_features = set(pd.concat([imputed_df, original_features_df],axis=1).columns).symmetric_difference(set(all_data.columns))
rest_features_df = all_data[list(rest_features)]
all_data = pd.concat([imputed_df, original_features_df, rest_features_df],axis=1)
all_data.shape


## Creating new features

These features seem to be useful for house price prediction. As they are not contained in kaggle dataset I decided to create them from other informations.

"TotalSqrtFeet" - Total Live Area

"TotalBaths" - Total Area for Bathrooms

In [None]:
# Total Squere Feet for house
all_data["TotalSqrtFeet"] = all_data["GrLivArea"] + all_data["TotalBsmtSF"]
test["TotalSqrtFeet"] = test["GrLivArea"] + test["TotalBsmtSF"]

# Total number of bathrooms
all_data["TotalBaths"] = all_data["BsmtFullBath"] + (all_data["BsmtHalfBath"]  * .5) + all_data["FullBath"] + (all_data["HalfBath"]* .5)
test["TotalBaths"] = test["BsmtFullBath"] + (test["BsmtHalfBath"]  * .5) + test["FullBath"] + (test["HalfBath"]* .5)

In [None]:
# If the house has a garage
all_data['Isgarage'] = all_data['GarageArea'].apply(lambda x: 1 if x > 0 else 0)

# If the house has a fireplace
all_data['Isfireplace'] = all_data['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

# If the house has a pool
all_data['Ispool'] = all_data['PoolArea'].apply(lambda x: 1 if x > 0 else 0)

# If the house has second floor
all_data['Issecondfloor'] = all_data['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)

# If the house has Open Porch
all_data['IsOpenPorch'] = all_data['OpenPorchSF'].apply(lambda x: 1 if x > 0 else 0)

# If the house has Wood Deck
all_data['IsWoodDeck'] = all_data['WoodDeckSF'].apply(lambda x: 1 if x > 0 else 0)

## Prepare Data

In [None]:
y_train = train['SalePrice']

In [None]:
all_data = all_data.drop(["SalePrice"], axis = 1)

hot_one_features = pd.get_dummies(all_data).reset_index(drop=True)
hot_one_features.shape

In [None]:
all_data = pd.concat([transformed_df, hot_one_features],axis=1)

In [None]:
#Splitting into train/test
train_preprocessed = all_data.iloc[:len(train),:]
test_preprocessed = all_data.iloc[len(train_preprocessed):,:]
print(len(test_preprocessed) == len(test))

In [None]:
#Modelization
X_train = train_preprocessed
X_test = test_preprocessed

In [None]:
def rmse(model):
    n_folds=5
    kfold = KFold(n_folds, random_state=42, shuffle=True).get_n_splits(X_train)
    rmse_score = np.sqrt(-cross_val_score(model, X_train, y_train, scoring = "neg_mean_squared_error", cv = kfold, verbose = -1, n_jobs=-1))
    return(np.mean(rmse_score))

In [None]:
X_train.shape,y_train.shape,X_test.shape

# Modles

# Linear Regression

In [None]:
lr_model = make_pipeline(RobustScaler(), LinearRegression())
lr_model.fit(X_train,y_train)
y_train_pred = lr_model.predict(X_train)
mse_train = round(mean_squared_error(y_train_pred,y_train),5)
rmse_lr = round(rmse(lr_model),5)
print('MSE for Linear Regression is :',mse_train)
print('RMSE for Linear Regression is :',rmse_lr)

In [None]:
# Prediction Plot
plt.figure(figsize=(14,6))
plt.scatter(y_train, y_train_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs. Predicted Prices")
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(14,5))
plt.scatter(y_train_pred,y_train_pred - y_train)
plt.title("Residual Plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.tight_layout()


# LASSO Model

In [None]:
ls_model = make_pipeline(RobustScaler(),LassoCV(alphas=[0.0005],random_state=0,cv=10))
ls_model.fit(X_train,y_train)
y_train_pred = ls_model.predict(X_train)
mse_train = round(mean_squared_error(y_train_pred,y_train),5)
rmse_lasso = round(rmse(ls_model),5)
print('MSE for Linear Regression is :',mse_train)
print('RMSE for Linear Regression is :',rmse_lasso)

In [None]:
# Prediction Plot
plt.figure(figsize=(14,6))
plt.scatter(y_train, y_train_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs. Predicted Prices")
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(14,5))
plt.scatter(y_train_pred,y_train_pred - y_train)
plt.title("Residual Plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.tight_layout()

# GradientBoostingRegressor

In [None]:
gbr = GradientBoostingRegressor(random_state=0)
param_grid = {'n_estimators': [3400],
              'max_features': [13],
              'max_depth': [5],
              'learning_rate': [0.01],
              'subsample': [0.8],
             'random_state' : [5]}
gb_model = GridSearchCV(estimator=gbr, param_grid=param_grid, n_jobs=1, cv=5)
gb_model.fit(X_train, y_train)
#gb_model.best_params3

In [None]:
y_train_pred = gb_model.predict(X_train)
mse_train = round(mean_squared_error(y_train_pred,y_train),5)
rmse_gb = round(rmse(gb_model),5)
print('MSE for Linear Regression is :',mse_train)
print('RMSE for Linear Regression is :',rmse_gb)

In [None]:
# Prediction Plot
plt.figure(figsize=(14,6))
plt.scatter(y_train, y_train_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs. Predicted Prices")
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(14,5))
plt.scatter(y_train_pred,y_train_pred - y_train)
plt.title("Residual Plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.tight_layout()

# XGB Regressor

In [None]:
xgbreg = xgb.XGBRegressor(seed=0)
param_grid2 = {'n_estimators': [2500], 
              'learning_rate': [0.03],
              'max_depth': [3],
              'subsample': [0.8],
              'colsample_bytree': [0.45]}
    
xgb_model = GridSearchCV(estimator=xgbreg, param_grid=param_grid2, n_jobs=1, cv=10)
xgb_model.fit(X_train, y_train)

In [None]:
y_train_pred = xgb_model.predict(X_train)
mse_train = round(mean_squared_error(y_train_pred,y_train),5)
rmse_xgb = round(rmse(xgb_model),5)
print('MSE for Linear Regression is :',mse_train)
print('RMSE for Linear Regression is :',rmse_xgb)

In [None]:
# Prediction Plot
plt.figure(figsize=(14,6))
plt.scatter(y_train, y_train_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs. Predicted Prices")
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(14,5))
plt.scatter(y_train_pred,y_train_pred - y_train)
plt.title("Residual Plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.tight_layout()

# ElasticNet

In [None]:
en_model = ElasticNetCV(alphas = [0.0005], 
                        l1_ratio = [.9], 
                        random_state = 0,
                        cv=10)
en_model.fit(X_train,y_train)

In [None]:
y_train_pred = xgb_model.predict(X_train)
mse_train = round(mean_squared_error(y_train_pred,y_train),5)
rmse_en = round(rmse(en_model),5)
print('MSE for Linear Regression is :',mse_train)
print('RMSE for Linear Regression is :',rmse_en)

In [None]:
# Prediction Plot
plt.figure(figsize=(14,6))
plt.scatter(y_train, y_train_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs. Predicted Prices")
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(14,5))
plt.scatter(y_train_pred,y_train_pred - y_train)
plt.title("Residual Plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.tight_layout()

# LightGBM

In [None]:
lgb_model = lgb.LGBMRegressor(objective='regression', num_leaves=5,
                              learning_rate=0.05, n_estimators=4000,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
lgb_model.fit(X_train,y_train)

In [None]:
y_train_pred = lgb_model.predict(X_train)
mse_train = round(mean_squared_error(y_train_pred,y_train),5)
rmse_lgb = round(np.sqrt(mean_squared_error(y_train_pred,y_train)),5)
print('MSE for Linear Regression is :',mse_train)
print('RMSE for Linear Regression is :',rmse_lgb)

In [None]:
# Prediction Plot
plt.figure(figsize=(14,6))
plt.scatter(y_train, y_train_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs. Predicted Prices")
plt.show()

In [None]:
# Residual plot
plt.figure(figsize=(14,5))
plt.scatter(y_train_pred,y_train_pred - y_train)
plt.title("Residual Plot")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.tight_layout()

# Stacking

In [None]:
from mlxtend.regressor import StackingCVRegressor

In [None]:
lasso_model = make_pipeline(RobustScaler(), 
                            LassoCV(max_iter= 10000000, alphas = [0.0005],random_state = 42, cv=5))

elasticnet_model = make_pipeline(RobustScaler(),
                                 ElasticNetCV(max_iter=10000000, alphas=[0.0005], cv=5, l1_ratio=0.9))

lgbm_model = make_pipeline(RobustScaler(),
                           lgb.LGBMRegressor(objective='regression',num_leaves=5,
                                             learning_rate=0.05, n_estimators=4000,
                                             max_bin = 55, bagging_fraction = 0.8,
                                             bagging_freq = 5, feature_fraction = 0.23,
                                             feature_fraction_seed = 9, bagging_seed=9,
                                             min_data_in_leaf = 6, 
                                             min_sum_hessian_in_leaf = 11))

xgboost_model = make_pipeline(RobustScaler(),
                              xgb.XGBRegressor(learning_rate = 0.01, n_estimators=3400,
                                               max_depth=3,min_child_weight=0 ,
                                               gamma=0, subsample=0.7,colsample_bytree=0.7,
                                               objective= 'reg:linear',nthread=4,
                                               scale_pos_weight=1,seed=27, reg_alpha=0.00006))

stack_regressor = StackingCVRegressor(regressors=(lasso_model, elasticnet_model, xgboost_model, lgbm_model),
                                      meta_regressor=xgboost_model, use_features_in_secondary=True)

In [None]:
stack_model = stack_regressor.fit(np.array(X_train),  np.array(y_train))

In [None]:
stack_gen_pred = stack_model.predict(X_test)
lgbm_pred = lgb_model.predict(X_test)
lasso_pred = ls_model.predict(X_test)
en_pred = en_model.predict(X_test)
xgb_pred = xgb_model.predict(X_test)
gb_pred = gb_model.predict(X_test)

In [None]:
error = pd.DataFrame({'Models': ['Linear Regression','Lasso Model','Gradient Boosting Regressor','XGB Regressor','Elastic Net','Light GBM'],
                     'RMSE':[rmse_lr,rmse_lasso,rmse_gb,rmse_xgb,rmse_en,rmse_lgb]})
error.sort_values(by='RMSE',ascending =False)

# Weighted predictions

In [None]:
stack_preds = ((0.1*xgb_pred) + (0.075*gb_pred) + (0.4*lgbm_pred) + (0.4*stack_gen_pred) +(0.025*en_pred) ) 

In [None]:
stack_preds

In [None]:
final_pred = np.expm1(stack_preds)

# Submission

In [None]:
sub = pd.read_csv('/kaggle/input/home-data-for-ml-course/sample_submission.csv')

In [None]:
sub['SalePrice'] = final_pred
sub.head()

In [None]:
sub.to_csv('submission.csv',index= False)