<a href="https://colab.research.google.com/github/Hbvsa/HousePricesKaggle/blob/main/HousePrices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
def concat_df(train_data, test_data):
    # Returns a concatenated df of training and test set
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

def divide_df(all_data):
    # Returns divided dfs of training and test set
    return all_data.loc[:1459], all_data.loc[1460:].drop(['SalePrice'], axis=1)

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_all = concat_df(df_train, df_test)

df_train.name = 'Training Set'
df_test.name = 'Test Set'
df_all.name = 'All Set'

dfs = [df_train, df_test]

print('Number of Training Examples = {}'.format(df_train.shape[0]))
print('Number of Test Examples = {}\n'.format(df_test.shape[0]))
print('Training X Shape = {}'.format(df_train.shape))
print('Training y Shape = {}\n'.format(df_train['SalePrice'].shape[0]))
print('Test X Shape = {}'.format(df_test.shape))
print('Test y Shape = {}\n'.format(df_test.shape[0]))
print(df_train.columns)
print(df_test.columns)

Number of Training Examples = 1460
Number of Test Examples = 1459

Training X Shape = (1460, 81)
Training y Shape = 1460

Test X Shape = (1459, 80)
Test y Shape = 1459

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 

There are a lot of missing values which are just the fact that indeed that feature is not present and therefore must be filled with its own category. These process needs you to go through the documentation. After going through the documentation you can see that all columns with a big number of missing values (>50) and which are of object dtype can be filled with 'NA' as a category.

In [3]:
# Set the maximum number of rows to be displayed
pd.set_option('display.max_rows', None)

# Set the maximum width of the columns to be displayed
pd.set_option('display.width', None)

In [4]:
df_all.isnull().sum()

1stFlrSF            0
2ndFlrSF            0
3SsnPorch           0
Alley            2721
BedroomAbvGr        0
BldgType            0
BsmtCond           82
BsmtExposure       82
BsmtFinSF1          1
BsmtFinSF2          1
BsmtFinType1       79
BsmtFinType2       80
BsmtFullBath        2
BsmtHalfBath        2
BsmtQual           81
BsmtUnfSF           1
CentralAir          0
Condition1          0
Condition2          0
Electrical          1
EnclosedPorch       0
ExterCond           0
ExterQual           0
Exterior1st         1
Exterior2nd         1
Fence            2348
FireplaceQu      1420
Fireplaces          0
Foundation          0
FullBath            0
Functional          2
GarageArea          1
GarageCars          1
GarageCond        159
GarageFinish      159
GarageQual        159
GarageType        157
GarageYrBlt       159
GrLivArea           0
HalfBath            0
Heating             0
HeatingQC           0
HouseStyle          0
Id                  0
KitchenAbvGr        0
KitchenQua

#Fixing missing values of object features

In [5]:
for col in df_all.columns:
    if df_all[col].dtype == 'object' and df_all[col].isnull().sum() > 50:
        print(f"Column: {col}")
        print(f"Number of missing values before filling: {df_all[col].isnull().sum()}")
        df_all[col] = df_all[col].fillna('NA')
        print(f"Number of missing values after filling: {df_all[col].isnull().sum()}")
        print("\n")

Column: Alley
Number of missing values before filling: 2721
Number of missing values after filling: 0


Column: BsmtCond
Number of missing values before filling: 82
Number of missing values after filling: 0


Column: BsmtExposure
Number of missing values before filling: 82
Number of missing values after filling: 0


Column: BsmtFinType1
Number of missing values before filling: 79
Number of missing values after filling: 0


Column: BsmtFinType2
Number of missing values before filling: 80
Number of missing values after filling: 0


Column: BsmtQual
Number of missing values before filling: 81
Number of missing values after filling: 0


Column: Fence
Number of missing values before filling: 2348
Number of missing values after filling: 0


Column: FireplaceQu
Number of missing values before filling: 1420
Number of missing values after filling: 0


Column: GarageCond
Number of missing values before filling: 159
Number of missing values after filling: 0


Column: GarageFinish
Number of missin

Now lets deal with the ones which still have a lot of missing values but are continous values

In [6]:
missing_values = df_all.isnull().sum()
missing_values = missing_values[missing_values != 0]
print(missing_values)

BsmtFinSF1         1
BsmtFinSF2         1
BsmtFullBath       2
BsmtHalfBath       2
BsmtUnfSF          1
Electrical         1
Exterior1st        1
Exterior2nd        1
Functional         2
GarageArea         1
GarageCars         1
GarageYrBlt      159
KitchenQual        1
LotFrontage      486
MSZoning           4
MasVnrArea        23
SalePrice       1459
SaleType           1
TotalBsmtSF        1
Utilities          2
dtype: int64


Garage YrBlt if the garagetype or garage location say there is no garage we can fill with 0

In [7]:
import pandas as pd
import numpy as np

# Assuming df_all is your dataset
# Fill missing values in the GarageYrBlt column based on the condition
cond1 = df_all['GarageCond'] == 'NA'
cond2 = df_all['GarageYrBlt'].isnull() & (df_all['GarageCond'] != 'NA')
df_all.loc[cond1, 'GarageYrBlt'] = 0
df_all.loc[cond2, 'GarageYrBlt'] = df_all.loc[cond2, 'YearBuilt']

In [8]:
missing_values = df_all.isnull().sum()
missing_values = missing_values[missing_values != 0]
print(missing_values)

BsmtFinSF1         1
BsmtFinSF2         1
BsmtFullBath       2
BsmtHalfBath       2
BsmtUnfSF          1
Electrical         1
Exterior1st        1
Exterior2nd        1
Functional         2
GarageArea         1
GarageCars         1
KitchenQual        1
LotFrontage      486
MSZoning           4
MasVnrArea        23
SalePrice       1459
SaleType           1
TotalBsmtSF        1
Utilities          2
dtype: int64


#Now for LotFrontage.
All properties should be connected in someway to a street even if the house itself is hidden inside a big property.
So it can't be zero. We can measure the correlation with the target variable to see it is important anyway




In [9]:
# Assuming df_all is your dataset
# Select the column you want to compute the correlation for
column_name = 'LotFrontage'

# Compute the correlation between the column and the target variable
correlation = df_all[column_name].corr(df_all['SalePrice'])

# Print the correlation to the console
print(f"The correlation between {column_name} and SalePrice is {correlation:.4f}")

The correlation between LotFrontage and SalePrice is 0.3518


It is important. So we are going to check if some other variables have a high correlation with this one.

In [10]:
df_all_corr = df_all.drop(['SalePrice'], axis=1).corr(numeric_only=True).abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_all_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_all_corr.drop(df_all_corr.iloc[1::2].index, inplace=True)
df_all_corr_nd = df_all_corr.drop(df_all_corr[df_all_corr['Correlation Coefficient'] == 1.0].index)

In [11]:
# Assuming df_all is your dataset
# Compute the correlation matrix of the dataset
corr_matrix = df_all.corr(numeric_only=True)

# Create a heatmap of the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, cmap='coolwarm', fmt='.2f', linewidths=0.1)
plt.title('Correlation matrix of features')
plt.show()

NameError: name 'plt' is not defined

In [12]:
# Assuming df_all is your dataset
# Compute the correlations of all variables with LotArea
corr_with_LotArea = df_all.corrwith(df_all['LotArea'],numeric_only=True)

# Print the correlations to the console
print(corr_with_LotArea)

1stFlrSF         0.332460
2ndFlrSF         0.031515
3SsnPorch        0.015995
BedroomAbvGr     0.132801
BsmtFinSF1       0.194031
BsmtFinSF2       0.084059
BsmtFullBath     0.128349
BsmtHalfBath     0.026292
BsmtUnfSF        0.021362
EnclosedPorch    0.020974
Fireplaces       0.261185
FullBath         0.125826
GarageArea       0.213251
GarageCars       0.180434
GarageYrBlt      0.073762
GrLivArea        0.284519
HalfBath         0.034244
Id              -0.040746
KitchenAbvGr    -0.020854
LotArea          1.000000
LotFrontage      0.489896
LowQualFinSF     0.000554
MSSubClass      -0.201730
MasVnrArea       0.125596
MiscVal          0.069029
MoSold           0.004156
OpenPorchSF      0.104797
OverallCond     -0.035617
OverallQual      0.100541
PoolArea         0.093708
SalePrice        0.263843
ScreenPorch      0.054375
TotRmsAbvGrd     0.213802
TotalBsmtSF      0.254138
WoodDeckSF       0.158045
YearBuilt        0.024128
YearRemodAdd     0.021612
YrSold          -0.024234
dtype: float

LotArea seems to have a good correlation so we will use that to fill the LotFrontage

In [13]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Assuming df_all is your dataset
# Create a copy of the dataset
df = df_all.copy()

# Drop missing values in LotFrontage
df_no_missing = df.dropna(subset=['LotFrontage'])

# Fit a linear regression model to predict LotFrontage based on LotArea
X = df_no_missing[['LotArea']]
y = df_no_missing['LotFrontage']
model = LinearRegression()
model.fit(X, y)

# Use the model to predict missing values in LotFrontage
df.loc[df['LotFrontage'].isnull(), 'LotFrontage'] = model.predict(df[df['LotFrontage'].isnull()][['LotArea']])

In [14]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values != 0]
print(missing_values)

BsmtFinSF1         1
BsmtFinSF2         1
BsmtFullBath       2
BsmtHalfBath       2
BsmtUnfSF          1
Electrical         1
Exterior1st        1
Exterior2nd        1
Functional         2
GarageArea         1
GarageCars         1
KitchenQual        1
MSZoning           4
MasVnrArea        23
SalePrice       1459
SaleType           1
TotalBsmtSF        1
Utilities          2
dtype: int64


#Before trying to fix any more missing values let us try to transform the categorical features into labels so that we can check all the correlation of the features with the target variable.


In [15]:
import pandas as pd

# Assuming df_train is your dataset
for column in df_train.columns:
    if df[column].dtype == 'object' :
        num_unique = df[column].unique()
        print(f"Categorical {column}:")
        print(df[column].unique()[:10])
        print("...")


Categorical MSZoning:
['RL' 'RM' 'C (all)' 'FV' 'RH' nan]
...
Categorical Street:
['Pave' 'Grvl']
...
Categorical Alley:
['NA' 'Grvl' 'Pave']
...
Categorical LotShape:
['Reg' 'IR1' 'IR2' 'IR3']
...
Categorical LandContour:
['Lvl' 'Bnk' 'Low' 'HLS']
...
Categorical Utilities:
['AllPub' 'NoSeWa' nan]
...
Categorical LotConfig:
['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
...
Categorical LandSlope:
['Gtl' 'Mod' 'Sev']
...
Categorical Neighborhood:
['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer']
...
Categorical Condition1:
['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
...
Categorical Condition2:
['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
...
Categorical BldgType:
['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
...
Categorical HouseStyle:
['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']
...
Categorical RoofStyle:
['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']
...
Categorical RoofM

In [16]:
import pandas as pd

# Assuming df_train is your dataset
for column in df_train.columns:
    if df_train[column].dtype != 'object' and column not in ['Id','YearBuilt','YearRemodAdd','YrSold','']:
        num_unique = df_train[column].unique()
        if np.max(num_unique) > 20:
            print(f"Continuous {column}:")
            print(df_train[column].unique()[:10])
            print("...")
        else:
            print(f"Ordinal{column}:")
            print(df_train[column].unique()[:10])

Continuous MSSubClass:
[ 60  20  70  50 190  45  90 120  30  85]
...
OrdinalLotFrontage:
[65. 80. 68. 60. 84. 85. 75. nan 51. 50.]
Continuous LotArea:
[ 8450  9600 11250  9550 14260 14115 10084 10382  6120  7420]
...
OrdinalOverallQual:
[ 7  6  8  5  9  4 10  3  1  2]
OrdinalOverallCond:
[5 8 6 7 4 2 3 9 1]
OrdinalMasVnrArea:
[196.   0. 162. 350. 186. 240. 286. 306. 212. 180.]
Continuous BsmtFinSF1:
[ 706  978  486  216  655  732 1369  859    0  851]
...
Continuous BsmtFinSF2:
[  0  32 668 486  93 491 506 712 362  41]
...
Continuous BsmtUnfSF:
[150 284 434 540 490  64 317 216 952 140]
...
Continuous TotalBsmtSF:
[ 856 1262  920  756 1145  796 1686 1107  952  991]
...
Continuous 1stFlrSF:
[ 856 1262  920  961 1145  796 1694 1107 1022 1077]
...
Continuous 2ndFlrSF:
[ 854    0  866  756 1053  566  983  752 1142 1218]
...
Continuous LowQualFinSF:
[  0 360 513 234 528 572 144 392 371 390]
...
Continuous GrLivArea:
[1710 1262 1786 1717 2198 1362 1694 2090 1774 1077]
...
OrdinalBsmtFullBath:


In [17]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
cat_columns = []
for column in df.columns:
    if df[column].dtype == 'object':
        df[column] = LabelEncoder().fit_transform(df[column])
        # Create a new dataframe with one-hot encoded columns
        #one_hot_encoded = pd.get_dummies(df[column], prefix=column, prefix_sep='_')
        # Drop the original column and add the one-hot encoded columns to df_train
        #df = df.drop(column, axis=1)
        #df = df.join(one_hot_encoded)
        cat_columns.append(column)

In [18]:
corr_with_HousePrices = df.corrwith(df['SalePrice']).abs()

# Print the correlations to the console
print(corr_with_HousePrices)
print(len(corr_with_HousePrices))

1stFlrSF         0.605852
2ndFlrSF         0.319334
3SsnPorch        0.044584
Alley            0.083121
BedroomAbvGr     0.168213
BldgType         0.085591
BsmtCond         0.091503
BsmtExposure     0.294589
BsmtFinSF1       0.386420
BsmtFinSF2       0.011378
BsmtFinType1     0.098734
BsmtFinType2     0.072717
BsmtFullBath     0.227122
BsmtHalfBath     0.016844
BsmtQual         0.593734
BsmtUnfSF        0.214479
CentralAir       0.251328
Condition1       0.091155
Condition2       0.007513
Electrical       0.234716
EnclosedPorch    0.128578
ExterCond        0.117303
ExterQual        0.636884
Exterior1st      0.103551
Exterior2nd      0.103766
Fence            0.140640
FireplaceQu      0.097176
Fireplaces       0.466929
Foundation       0.382479
FullBath         0.560664
Functional       0.115328
GarageArea       0.623431
GarageCars       0.640409
GarageCond       0.246705
GarageFinish     0.425684
GarageQual       0.205963
GarageType       0.415283
GarageYrBlt      0.261366
GrLivArea   

In [19]:
# Select the columns that have a correlation above 0.1 with SalePrice
correlated_columns = list(corr_with_HousePrices[corr_with_HousePrices > 0.1].index)

df = df[correlated_columns + ['Id']]

In [20]:
df.isnull().sum()

1stFlrSF            0
2ndFlrSF            0
BedroomAbvGr        0
BsmtExposure        0
BsmtFinSF1          1
BsmtFullBath        2
BsmtQual            0
BsmtUnfSF           1
CentralAir          0
Electrical          0
EnclosedPorch       0
ExterCond           0
ExterQual           0
Exterior1st         0
Exterior2nd         0
Fence               0
Fireplaces          0
Foundation          0
FullBath            0
Functional          0
GarageArea          1
GarageCars          1
GarageCond          0
GarageFinish        0
GarageQual          0
GarageType          0
GarageYrBlt         0
GrLivArea           0
HalfBath            0
HeatingQC           0
HouseStyle          0
KitchenAbvGr        0
KitchenQual         0
LotArea             0
LotFrontage         0
LotShape            0
MSZoning            0
MasVnrArea         23
Neighborhood        0
OpenPorchSF         0
OverallQual         0
PavedDrive          0
PoolQC              0
RoofMatl            0
RoofStyle           0
SaleCondit

#Fixing the last missing values

In [21]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values != 0]
print(missing_values)

BsmtFinSF1         1
BsmtFullBath       2
BsmtUnfSF          1
GarageArea         1
GarageCars         1
MasVnrArea        23
SalePrice       1459
TotalBsmtSF        1
dtype: int64


In [22]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# Identify columns with missing values
missing_columns = list(df.columns[df.isnull().any()])
missing_columns.remove('SalePrice')

# Fill the missing values in each column
for col in missing_columns:
    print(col)
    max_value = df[col].max()
    if max_value > 20:
        # Use mode to fill the missing values
        median = df[col].median()
        df[col].fillna(median, inplace=True)
    else:

        mode = df[col].mode().iloc[0]
        print("mode",mode)
        df[col].fillna(mode, inplace=True)

BsmtFinSF1
BsmtFullBath
mode 0.0
BsmtUnfSF
GarageArea
GarageCars
mode 2.0
MasVnrArea
TotalBsmtSF


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mode, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a c

In [23]:
cat_columns

['Alley',
 'BldgType',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'BsmtQual',
 'CentralAir',
 'Condition1',
 'Condition2',
 'Electrical',
 'ExterCond',
 'ExterQual',
 'Exterior1st',
 'Exterior2nd',
 'Fence',
 'FireplaceQu',
 'Foundation',
 'Functional',
 'GarageCond',
 'GarageFinish',
 'GarageQual',
 'GarageType',
 'Heating',
 'HeatingQC',
 'HouseStyle',
 'KitchenQual',
 'LandContour',
 'LandSlope',
 'LotConfig',
 'LotShape',
 'MSZoning',
 'MasVnrType',
 'MiscFeature',
 'Neighborhood',
 'PavedDrive',
 'PoolQC',
 'RoofMatl',
 'RoofStyle',
 'SaleCondition',
 'SaleType',
 'Street',
 'Utilities']

In [24]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
cat_columns
for column in df:
    if column in cat_columns:
        #Create a new dataframe with one-hot encoded columns
        one_hot_encoded = pd.get_dummies(df[column], prefix=column, prefix_sep='_')
        # Drop the original column and add the one-hot encoded columns to df_train
        df = df.drop(column, axis=1)
        df = df.join(one_hot_encoded)

In [25]:
df.columns

Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFullBath',
       'BsmtUnfSF', 'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea',
       ...
       'RoofStyle_2', 'RoofStyle_3', 'RoofStyle_4', 'RoofStyle_5',
       'SaleCondition_0', 'SaleCondition_1', 'SaleCondition_2',
       'SaleCondition_3', 'SaleCondition_4', 'SaleCondition_5'],
      dtype='object', length=210)

Model fit

In [26]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values != 0]
print(missing_values)

SalePrice    1459
dtype: int64


In [27]:
df_train,df_test = divide_df(df)

In [28]:
df_train.columns

Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFullBath',
       'BsmtUnfSF', 'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea',
       ...
       'RoofStyle_2', 'RoofStyle_3', 'RoofStyle_4', 'RoofStyle_5',
       'SaleCondition_0', 'SaleCondition_1', 'SaleCondition_2',
       'SaleCondition_3', 'SaleCondition_4', 'SaleCondition_5'],
      dtype='object', length=210)

In [34]:
X = StandardScaler().fit_transform(df_train.drop(columns=['SalePrice','Id']))
y = np.log1p(df_train['SalePrice'].values)
X_sub= StandardScaler().fit_transform(df_test.drop(columns=['Id']))

print('X_train shape: {}'.format(X.shape))
print('y_train shape: {}'.format(y.shape))
print('X_test shape: {}'.format(X_sub.shape))

X_train shape: (1460, 208)
y_train shape: (1460,)
X_test shape: (1459, 208)


#Get blendings models from other notebook example

In [35]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from datetime import datetime
from scipy.stats import skew  # for some statistics
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from mlxtend.regressor import StackingCVRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import matplotlib.pyplot as plt
import scipy.stats as stats
import sklearn.linear_model as linear_model
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler



In [36]:
print('START ML', datetime.now(), )

kfolds = KFold(n_splits=10, shuffle=True, random_state=42)


# rmsle
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))


# build our model scoring function
def cv_rmse(model, X=X):
    rmse = np.sqrt(-cross_val_score(model, X, y,
                                    scoring="neg_mean_squared_error",
                                    cv=kfolds))
    return (rmse)


# setup models
alphas_alt = [14.5, 14.6, 14.7, 14.8, 14.9, 15, 15.1, 15.2, 15.3, 15.4, 15.5]
alphas2 = [5e-05, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008]
e_alphas = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007]
e_l1ratio = [0.8, 0.85, 0.9, 0.95, 0.99, 1]

ridge = make_pipeline(RobustScaler(),
                      RidgeCV(alphas=alphas_alt, cv=kfolds))

lasso = make_pipeline(RobustScaler(),
                      LassoCV(max_iter=int(1e7), alphas=alphas2,
                              random_state=42, cv=kfolds))

elasticnet = make_pipeline(RobustScaler(),
                           ElasticNetCV(max_iter=int(1e7), alphas=e_alphas,
                                        cv=kfolds, l1_ratio=e_l1ratio))

svr = make_pipeline(RobustScaler(),
                      SVR(C= 20, epsilon= 0.008, gamma=0.0003,))


gbr = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10,
                                   loss='huber', random_state =42)


lightgbm = LGBMRegressor(objective='regression',
                                       num_leaves=4,
                                       learning_rate=0.01,
                                       n_estimators=5000,
                                       max_bin=200,
                                       bagging_fraction=0.75,
                                       bagging_freq=5,
                                       bagging_seed=7,
                                       feature_fraction=0.2,
                                       feature_fraction_seed=7,
                                       verbose=-1,
                                       #min_data_in_leaf=2,
                                       #min_sum_hessian_in_leaf=11
                                       )


xgboost = XGBRegressor(learning_rate=0.01, n_estimators=3460,
                                     max_depth=3, min_child_weight=0,
                                     gamma=0, subsample=0.7,
                                     colsample_bytree=0.7,
                                     objective='reg:linear', nthread=-1,
                                     scale_pos_weight=1, seed=27,
                                     reg_alpha=0.00006)

# stack
stack_gen = StackingCVRegressor(regressors=(ridge, lasso, elasticnet,
                                            gbr, xgboost, lightgbm),
                                meta_regressor=xgboost,
                                use_features_in_secondary=True)


print('TEST score on CV')

score = cv_rmse(ridge)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )

score = cv_rmse(lasso)
print("Lasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )

score = cv_rmse(elasticnet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )

score = cv_rmse(svr)
print("SVR score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )

score = cv_rmse(lightgbm)
print("Lightgbm score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )

score = cv_rmse(gbr)
print("GradientBoosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )

score = cv_rmse(xgboost)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()), datetime.now(), )


print('START Fit')
print(datetime.now(), 'StackingCVRegressor')
stack_gen_model = stack_gen.fit(np.array(X), np.array(y))
print(datetime.now(), 'elasticnet')
elastic_model_full_data = elasticnet.fit(X, y)
print(datetime.now(), 'lasso')
lasso_model_full_data = lasso.fit(X, y)
print(datetime.now(), 'ridge')
ridge_model_full_data = ridge.fit(X, y)
print(datetime.now(), 'svr')
svr_model_full_data = svr.fit(X, y)
print(datetime.now(), 'GradientBoosting')
gbr_model_full_data = gbr.fit(X, y)
print(datetime.now(), 'xgboost')
xgb_model_full_data = xgboost.fit(X, y)
print(datetime.now(), 'lightgbm')
lgb_model_full_data = lightgbm.fit(X, y)

START ML 2024-04-17 21:49:44.991169
TEST score on CV
Kernel Ridge score: 0.1453 (0.0435)
 2024-04-17 21:50:17.876187
Lasso score: 0.1414 (0.0449)
 2024-04-17 21:50:44.028651
ElasticNet score: 0.1418 (0.0449)
 2024-04-17 21:51:54.156720
SVR score: 0.1378 (0.0248)
 2024-04-17 21:51:57.464114
Lightgbm score: 0.1277 (0.0174)
 2024-04-17 21:52:17.022319
GradientBoosting score: 0.1291 (0.0208)
 2024-04-17 21:53:59.845198




Xgboost score: 0.1235 (0.0191)
 2024-04-17 21:54:35.155634
START Fit
2024-04-17 21:54:35.155634 StackingCVRegressor




2024-04-17 21:57:18.624428 elasticnet
2024-04-17 21:57:24.734367 lasso
2024-04-17 21:57:27.013150 ridge
2024-04-17 21:57:30.081137 svr
2024-04-17 21:57:30.423341 GradientBoosting
2024-04-17 21:57:40.236327 xgboost




2024-04-17 21:57:43.448480 lightgbm


In [37]:
def blend_models_predict(X):
    return ((0.1 * elastic_model_full_data.predict(X)) + \
            (0.1 * lasso_model_full_data.predict(X)) + \
            (0.1 * ridge_model_full_data.predict(X)) + \
            (0.1 * svr_model_full_data.predict(X)) + \
            (0.1 * gbr_model_full_data.predict(X)) + \
            (0.15 * xgb_model_full_data.predict(X)) + \
            (0.1 * lgb_model_full_data.predict(X)) + \
            (0.25 * stack_gen_model.predict(np.array(X))))

print('RMSLE score on train data:')
print(rmsle(y, blend_models_predict(X)))

RMSLE score on train data:
0.06725039717610752


In [40]:
print('Predict submission', datetime.now(),)
submission = pd.read_csv("sample_submission.csv")
submission.iloc[:,1] = np.floor(np.expm1(blend_models_predict(X_sub)))

# this kernel gave a score 0.114
# let's up it by mixing with the top kernels

print('Blend with Top Kernals submissions', datetime.now(),)
sub_1 = pd.read_csv('House_Prices_submit.csv')
sub_2 = pd.read_csv('hybrid_solution.csv')
sub_3 = pd.read_csv('lasso_sol22_Median.csv')

submission.iloc[:,1] = np.floor((0.25 * np.floor(np.expm1(blend_models_predict(X_sub)))) +
                                (0.25 * sub_1.iloc[:,1]) +
                                (0.25 * sub_2.iloc[:,1]) +
                                (0.25 * sub_3.iloc[:,1]))


Predict submission 2024-04-17 22:02:06.940299
Blend with Top Kernals submissions 2024-04-17 22:02:07.656495


In [None]:
# Brutal approach to deal with predictions close to outer range
q1 = submission['SalePrice'].quantile(0.0045)
q2 = submission['SalePrice'].quantile(0.99)

submission['SalePrice'] = submission['SalePrice'].apply(lambda x: x if x > q1 else x*0.77)
submission['SalePrice'] = submission['SalePrice'].apply(lambda x: x if x < q2 else x*1.1)

submission.to_csv("new_submission.csv", index=False)
print('Save submission', datetime.now(),)

In [41]:
submission.to_csv("new_submission.csv", index=False)
print('Save submission', datetime.now(),)

Save submission 2024-04-17 22:02:11.439886


#Single XGBoost with bayesian search

In [45]:
import optuna
import xgboost as xgb
import numpy as np
from sklearn.model_selection import KFold

# Define the objective function for Optuna
def objective(trial):
    # Define the hyperparameters to tune
    max_depth = trial.suggest_int('max_depth', 1, 20)
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-4, 1e-1)
    n_estimators = trial.suggest_int('n_estimators', 10, 1000)
    gamma = trial.suggest_uniform('gamma', 0, 1)
    min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
    subsample = trial.suggest_uniform('subsample', 0.1, 1.0)
    colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1.0)
    reg_lambda = trial.suggest_uniform('reg_lambda', 0, 10)

    # Create an XGBRegressor with the selected hyperparameters
    clf = xgb.XGBRegressor(
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        gamma=gamma,
        min_child_weight=min_child_weight,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        reg_lambda=reg_lambda,
        random_state=42,
    )

    # Evaluate the regressor using k-fold cross-validation
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = []
    for train_index, val_index in kf.split(X):
        X_train_fold, X_val_fold = X[train_index], X[val_index]
        y_train_fold, y_val_fold = y[train_index], y[val_index]
        clf.fit(X_train_fold, y_train_fold)
        score = -np.mean((y_val_fold - clf.predict(X_val_fold))**2)
        scores.append(score)

    # Return the mean score of the folds
    return np.mean(scores)

# Create an Optuna study and run the optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

# Get the best hyperparameters
best_params = study.best_params
print('Best hyperparameters: ', best_params)

# Create a new XGBRegressor with the best hyperparameters
best_clf = xgb.XGBRegressor(**best_params, random_state=42)

# Fit the regressor to the training data
best_clf.fit(X, y)

# Make predictions on the test data
y_pred = best_clf.predict(X_sub)

[I 2024-04-18 01:16:16,629] A new study created in memory with name: no-name-bcc86be0-433d-4dca-bfaf-c77519d10a9e
  learning_rate = trial.suggest_loguniform('learning_rate', 1e-4, 1e-1)
  gamma = trial.suggest_uniform('gamma', 0, 1)
  subsample = trial.suggest_uniform('subsample', 0.1, 1.0)
  colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1.0)
  reg_lambda = trial.suggest_uniform('reg_lambda', 0, 10)
[I 2024-04-18 01:16:18,014] Trial 0 finished with value: -0.09255804897900358 and parameters: {'max_depth': 19, 'learning_rate': 0.0026567882764080455, 'n_estimators': 177, 'gamma': 0.5268790291934162, 'min_child_weight': 10, 'subsample': 0.8846027411982053, 'colsample_bytree': 0.3155367735679996, 'reg_lambda': 6.736229903302608}. Best is trial 0 with value: -0.09255804897900358.
[I 2024-04-18 01:16:22,724] Trial 1 finished with value: -0.11932307284925436 and parameters: {'max_depth': 3, 'learning_rate': 0.00030594530435451047, 'n_estimators': 771, 'gamma': 0.2148795303

Best hyperparameters:  {'max_depth': 17, 'learning_rate': 0.01790681562035339, 'n_estimators': 660, 'gamma': 0.017999693663639908, 'min_child_weight': 3, 'subsample': 0.24620893048844877, 'colsample_bytree': 0.7468983652994997, 'reg_lambda': 7.9030084950602895}


In [48]:
print('RMSLE score on train data:')
print(rmsle(y, best_clf.predict(X)))

RMSLE score on train data:
0.08127922082596462


In [59]:
print('Predict submission', datetime.now(),)
submission = pd.read_csv("sample_submission.csv")
submission.iloc[:,1] = np.floor(np.expm1(y_pred))


# let's up it by mixing with the top kernels

print('Blend with Top Kernals submissions', datetime.now(),)
sub_1 = pd.read_csv('House_Prices_submit.csv')
sub_2 = pd.read_csv('hybrid_solution.csv')
sub_3 = pd.read_csv('lasso_sol22_Median.csv')

submission.iloc[:,1] = np.floor((0.25 * np.floor(np.expm1(best_clf.predict(X_sub)))) +
                                (0.25 * sub_1.iloc[:,1]) +
                                (0.25 * sub_2.iloc[:,1]) +
                                (0.25 * sub_3.iloc[:,1]))


Predict submission 2024-04-18 01:27:31.097794
Blend with Top Kernals submissions 2024-04-18 01:27:31.100794


In [60]:
# Brutal approach to deal with predictions close to outer range
q1 = submission['SalePrice'].quantile(0.0045)
q2 = submission['SalePrice'].quantile(0.99)

submission['SalePrice'] = submission['SalePrice'].apply(lambda x: x if x > q1 else x*0.77)
submission['SalePrice'] = submission['SalePrice'].apply(lambda x: x if x < q2 else x*1.1)

submission.to_csv("new_submission.csv", index=False)
print('Save submission', datetime.now(),)

Save submission 2024-04-18 01:27:32.807527


In [61]:
submission.to_csv("new_submission.csv", index=False)
print('Save submission', datetime.now(),)

Save submission 2024-04-18 01:27:33.525108


In [56]:
submission

Unnamed: 0,Id,SalePrice
0,1461,129038.0
1,1462,159492.0
2,1463,176073.0
3,1464,178815.0
4,1465,170264.0
5,1466,169906.0
6,1467,153851.0
7,1468,157974.0
8,1469,165112.0
9,1470,120306.0
